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ADAPTIVE PARTITIONING TECHNIQUES Thus, there is provided a technique for routing requests 

IN PERFORMING QUERY REQUESTS AND within a computer system which takes into account particu- 

REQUEST ROUTING larities of the various hardware and software in the computer 

system while simultaneously providing an adaptive tech- 

CROSS REFERENCE TO RELATED 5 nique in accordance with particular user queries to provide 

APPLICATIONS for better utilization of computer resources. 

The present application is related to the following ten R nF^PR FPTTOfJ OF or awimpq 

copending United States patent applications each filed on BRIEF DKCRn ™ N OF DRAWINGS 

Man 31, 1999, each having its assignee of the entire interest The above-mentioned and other features of the invention 

in common with the assignee entire interest of the present 10 will now become apparent by reference to the following 

application, and having titles and serial numbers as follow description taken in connection with the accompanying 

TARGETED BANNER ADVERTISEMENTS, Sen No. drawings, in which: 

09/282,764; now pending COMMON TERM FIG. 1 is an example of an embodiment of a system that 

OPTIMIZATION, Ser. No. 09/282,356; now pending mchides M query toot 

GENERIC OBJECT FOR RAPID INTEGRATION OF 15 Pir , . . , . . . .. f , , 

DATA CHANGES, Ser. No. 09/283,815; now pending • nG '* 15 M ""M* °* a blo <* of f hardware 

EFFICIENT DATATRANSFER MECHANISM FOR SYN VICW ° f m embodiment of ™ W to* 

CHRONIZATION OF MULTI-MEDIA DATABASES, Ser. FIG ' 3 15 an exam P le of an embodiment of a user interface 

No. 09/283,816; now pending NEW ARCHITECTURE dMayed with an on-line query tool; 

FOR ON-LINE QUERY TOOL, Ser. No. 09/283,837; now 20 FIG. 4 is an example of a block diagram of a software 

pending DATA ENHANCEMENT TECHNIQUES, Ser. No. view of an online query tool of FIG. 2; 

09/282,342; now pending DATA MERGING FIG. 5 is an example of an embodiment of a table 

TECHNIQUES, Ser. No. 09/282,295; now abandoned illustrating data storage for denaturalized objects in the 

TECHNIQUES FOR PERFORMING INCREMENTAL databases. 

DATA UPDAIES, Scr. No. 09/283,820; now pending 25 FIG. 6 is an example of an embodiment of a table 

WEIGHTED TERM RANKING FOR ON-UNE QUERY representing data stored in the generic object dictionary; 

TOOL, Ser. No. 09/282,730; now pending and, HYBRID ™ - - t f , A . * ^ J * 

CATEGORY MAPPING FOR ON-LINE QUERY TOOL, f pul^i ^T? " embochment of a P ortlOD 440 

c XT nnnoi i£o ^- °* a PHTML execution tree; 

Ser. No. 09/283,268 now pending. 0 • 

30 FIG. 8 is an example of an embodiment showing more 

BACKGROUND OF THE INVENTION detail of the parse driver, 

TOsapphcation generally relates to routing requests and ^ 10 T " °* ? uscr 1 . inlclfilce 

performing queries in a computer system. More particularly, ^ layed m reSp0nse to a USer ret * uest ^ 111 onhne W 
this application relates to performing adaptive partitioning 

techniques in a computer system when performing user FIG * 11 is ^ example of an embodiment of a user 

queries and routing user requests. interface displayed with user query information; 

Requests within a computer system, such as a distributed FIG. 12 is an example of the query results displayed in 

system, may be assigned to particular server nodes in ' 'response to performing a user query of FIG. 11; , 

accordance with load balancing techniques. These load 40 FIG. 13 is an example of a user interface which includes 

balancing techniques may include routing requests to a user-specified query information; 

particular node based on factors relating to the dynamic state FIG. 14 is an example of a resulting display page in 

of a node, such as the current processing load and availabil- response to the query performed with information specified 

ity of a server node. Additionally, static node characteristics in FIG. 13; 

and capabilities may also be taken into account when routing 45 FIG. 15 is a more detailed display in response to choosing 

a particular user request. For example, the particular pro- a particular category of FIG. 14; 

cessing speed of the CPU of a particular server node may be piGS. 16 and 17 are an example of a user interface 

a factor, as may be the availability of a particular type of displayed in response to selecting an option from the menu 

software or hardware. of nG 3 to add or change a hsihig; 

Thus, there is required a technique for routing requests 50 FIG. 18 is an example of a display screen in response to 

within a computer system which takes into account partial- updating the business listing specified in FIGS. 16 and 17; 

larities of me various hardware and software in the computer rnr>c m j m , r . r 

. l4 . ... , 4 . \ . MLrb. \y and 20 are an example of a user interface screen 

system whUe smmltaneously providing an adapUve tedh- ^ , jn ^ to , ^ d 

nique in accordance with particular user queries to provide pj^ 18* to 

for better utilization of computer resources. 55 " ' . , , 

MG. 21 is an example of a screen display to a user with 

SUMMARY OF THE INVENTION more information with regard to the business listing selected 

In accordance with principles of the invention is a method ^T^L^f 0 . 

for performing data query caching in a computer system. A ^ ■ tbcbusmess "^formation displayed with regard 

data domain is partitioned into one or more partitions. One 60 to business m FIG * 21; 

or more of said partitions are associated with one or more FIG - ^ is an example of an embodiment of the processes 

nodes in a computer system. A request for a data query is included in the request router of FIG. 22; 

classified as pertaining to a particular one of the partitions. FIG. 24 is an example of a block diagram of an embodi- 

The request is routed to a node in the computer system in ment of the BackOffice component; 

accordance with the particular one of the partitions. Data 65 FIG. 25 is an example of the flow process representing the 

from a data query cache associated with the node is used in processing of normalized data to the various data forms 

performing a data query included in the request. included in the Front End Server; 
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FIG. 26 is an example of normalized data as may be 
included in an embodiment of the invention; 

FIG. 27 is an example of de normalized data form as may 
be included in an embodiment of the invention; 

FIG. 28 is a flowchart of an example of an embodiment 
of a method for performing request processing in the system 
of FIG. 2 and 4; 

FIG. 29 is a flowchart of an example of an embodiment 
of the method steps for performing parser processing in the 
system of FIG. 2 and 4; 

FIG. 30 is a flowchart of an example of a method with 
steps for performing query engine processing in the system 
of FIG. 2 and 4; 

FIG. 31 is an example of a dependency graph as may be 
included in one embodiment of the invention for performing 
incremental update; 

FIG. 32 is an example of a flowchart of the method steps 
for performing different update techniques in accordance 
with the number of transactions; 

FIG. 33 is a flowchart of an example of method steps of 
one embodiment for performing data query cache lookup as 
used in performing a data query; 

FIG. 34 represents an example of applying the minimum 
cost derivation sequence as applied in the step of FIG. 33; 

FIG. 35 is a flowchart of an embodiment of method with 
steps for forming a name and determining if the correspond- 
ing data set is located in the query cache; 

FIG. 36 is an example of an entity as stored in the data 
query cache; 

FIG. 37 is a flowchart of an embodiment of a method 
including steps for performing an additional total-city cache 
lookup; 

FIGS. 37 and 38 are flowcharts for a method in one 
embodiment for performing total-city and multi-city cache 
searches; 

FIG. 39 is an example of more details that may be 
included in a embodiment of the query engine; 

FIG. 40 is an example of an embodiment of method steps 
by which the information retrieval software may obtain 
results; 

FIG. 41 is a flow chart showing an example of an 
embodiment of method steps for obtaining results; 

FIG. 42 is a flow chart showing an example of method 
steps for classifying results for queries using common terms; 

FIG. 43 depicts an example of a user interface for an 
on-line query tool, including a screen for initiating a user 
query; 

FIG. 44 depicts an example of a user interface for an 
on-line query tool, including categories that may be 
retrieved in response to initiation of a user query; 

FIG. 45 is a block diagram of an embodiment of the 
database as may be included in the BackOffice component; 

FIG. 46 through 52 are flowcharts depicting processing 
steps in a method of one embodiment for performing foreign 
source data integration; and 

FIGS. 53 through 58 are flowcharts of a method of one 
embodiment for performing native source data integration 
processing. 

FIG. 59 is an example of an embodiment of data tables 
included on a sending node for a multi-media data transfer; 

FIG. 60 is an example of an embodiment of the tables as 
appearing on the sending side and the receiving side in the 
multi-media data transfer; 
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FIG. 61 is an example of a representation of a tree 
structure representing the relationships between entitites 
used in the multi-media transfer, 

FIG. 62 is a snapshot of the tables that may be included 
5 in a preferred embodiment in sending data in a multi-media 
data transfer, 

FIG. 63 is a snapshot of an example of an embodiment of 
the tables on the sending and receiving side at another point 
iq when performing a multi-media data transfer; 

FIG. 64 is an example of an embodiment of tables and 
external processes on the sending and receiving side using 
the multi-media data transfer; 

FIG. 65 is an example of an embodiment of the tables 
15 resulting from the text data integration; 

FIG. 66 is an example of a block diagram of an embodi- 
ment of the data table whose contents have been transferred 
to the receiving side; 

FIG. 67 is a flowchart of a method of the steps of one 
20 embodiment for assembling blob data into a repository table 
when performing a multi-media data transfer; 

FIG. 68 is a flow chart setting forth method steps for 
establishing super-category term lists and for matching 
advertisements to super^categories, to assist in targeting an 
25 advertisement to a user of an on-line query tool; 

FIG. 69 is a flow chart setting forth method steps for 
mapping categories to super-categories; 

FIG. 70 is a flow chart setting forth method steps for 
^ executing a modified query in an on-line query tool designed 
to assist in targeting an advertisement to a user of an on-line 
query tool; and 

FIG. 71 is a diagram showing an example of a linked 
super-category term list. 

35 DETAILED DESCRIPTION OF THE 

PREFERRED EMBODIMENT 

Referring now to FIG. 1, shown is an embodiment of an 
on-line query tool 1910. In an embodiment, one or more 

40 users 1900-1904 may connect to the on-line query tool 1910 
via a network 1906. Users may interact with the query tool 
using conventional hardware and software, such as, in an 
embodiment, a web browser through the Internet. 

Referring now to FIG. 2, shown is an embodiment of a 

45 hardware view of an on-line query tool. In one embodiment, 
this on-line query tool may be the GTE Superpages SM query 
tool. FIG. 2 shows a hardware view of the components that 
may be included in one embodiment of the query tool in 
typical operation as being accessed by a user through a 

50 network. The user 800 enters a query request which is sent 
via a network 802, such as the Internet, to the GTE Super- 
pages Front End Server 804. The GTE Superpages Front 
End Server 804 includes a hardware router 806 for receiving 
incoming query requests. The hardware router routes the 

55 request, using a simple hardware-based technique, to one of 
the server nodes 808-810 which may be designated to 
service the request by performing the requested query. The 
servers 808 through 810, server 1 through server n, 
respectively, interact with the Primary Database 812 and 

60 Secondary Database 814 to perform a data query. The 
Primary Database 812 interacts with the BackOffice compo- 
nent 818 at times, as will be described in paragraphs 
elsewhere herein, to obtain data used in performing the 
queries. The BackOffice component 818 performs data fil- 

65 tering and other processing, for example, to combine infor- 
mation that may be obtained from various data sets produc- 
ing a resultant data set. The resultant data set is subsequently 
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transferred to the Primary Database for use by the various 
server nodes 808 through 810. 

The process of data integration and updating the data, for 
example, from the BackOffice to the Front End Server, may 
be performed at a time other than peak demand time. These 
processes and data transfer techniques, as will be described 
in following paragraphs, arc generally performed "off-line" 
and not in response to user query requests. Rather, these 
techniques may be performed as part of a data maintenance 
and update process performed in accordance with the system 
load and the number and type of update transactions. 

FIG. 2 depicts a Superpages Front End Server 804 which 
includes a varying number of server nodes 808-810 to 
respond to the various query requests as made by a user 800. 
The techniques and concepts which are described in para- 
graphs that follow may be used in a variety of different 
systems which include one or more server systems. 
Additionally, a single database or other datastore may be 
used. The techniques described herein may generally be 
applied to a large distributed system. Additionally, these 
same concepts and techniques may be applied in a single 
user system performing data queries and searches upon a 
local database. 

Referring now to FIG. 3, shown is an example of a user 
interface screen as included in one embodiment of the 
system of FIG. 2. Generally, FIG. 3 is the initial screen 1800 
that may be displayed to a user entering a URL correspond- 
ing to the GTE Superpages Internet site. FIG. 3 includes 
fields for query information 1802-1808, hyperlinks to other 
tools 1810, such as on-line shopping or placing 
advertisements, and other links 1812, for performing other 
tasks such as modifying an existing business listing. 

The GTE Superpages Internet site is related to on-line 
yellow pages, similar to those included in a paper phone 
book. With these on-line yellow pages, various business 
services and user services may be provided. For example, a 
user may query the on-line yellow page information for 
various businesses in the United States based on' particular' 
search criteria. Online shopping information regarding prod- 
ucts and business services may be provided to a user 
performing a data query. Advertisers, such as the business 
providers of the various products and services, may also 
purchase advertisements similar to those that may be pur- 
chased in the paper copy of a phone book that includes 
yellow page listings of businesses. 

The interface 1800 may include links to various services 
and functions. For example, one service provided permits 
businesses to advertise in the on-line yellow pages. Func- 
tions associated with this service may include, for example, 
purchasing advertisements and adding or changing a busi- 
ness listing that an advertiser or business includes in the 
yellow pages. In FIG. 3, some of these functions are 
included in the interface portion 1812, with links to other 
tools in the screen portion 1810. A user may connect with 
any of these tools or functions to perform tasks related to the 
yellow pages advertising by selecting an option from the 
user interface 1800, such as by left-clicking with a mouse. 

Other interfaces with varying functions may be directed to 
a user. Other types of network connections in addition to the 
Internet may also be included in other embodiments and 
may vary with each application and embodiment. 

Referring now to FIG. 4, shown is an embodiment of the 
various software components for an on-line query system. 
One embodiment may be the on-line query tool of the GTE 
Superpages system. FIG. 4 depicts a software view of the 
typical operation of the system as being accessed by a user 
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800 through a network 802 using the hardware as described 
in conjunction with FIG. 2. As previously described, the user 
may enter a request, as through a browser. This request is 
communicated through the GTE Superpages Front End 

5 Server 804 over the network 802. As shown in FIG. 4, the 
Front End Server 804 includes server node 808 that includes 
a web server engine 852. In one embodiment, the web server 
engine 852 is a Netscape™ engine which serves as a central 
coordinating task for accessing files and displaying infor- 

1Q mation to the user on the browser 824. The server node 808 
also includes a request router 854, a monitor process 856 and 
a parser 866. The parser 866 generally includes a parse 
driver 858, a generic object dictionary 860, a query engine 
862, and a data manager 864. The parse driver 858 operates 
upon data from a constructed ad repository 842 and the 

15 PHTML files 844. Additionally, the parse driver 858 stores 
and retrieves data from the PHTML execution tree 846 and 
the page cache 848. The data manager 864 included in the 
parser 866 is responsible for interacting with the database, 
which in the FIG. 4 is the Primary Database 812. It should 

20* also be noted that the data manager 864 may also obtain data 
from a Secondary Database as previously shown in FIG. 4. 
If there are multiple databases other than a Primary and 
Secondary Database, the data manager may also interact 
with these to obtain the necessary data upon which data 

25 queries are performed. The query engine 862 operates upon 
data from, and writes data to, the data query cache 850. 
Additionally, the query engine uses data from the term lists 
836 to obtain identifiers and possibly other retrievable data 
in accordance with various key terms upon which a data 

30 query is being performed. The request router 854 generally 
interacts with the parser and reads data from the configura- 
tion file 830 and load file 834. The monitor process 856 also 
reads and writes data to and from respectively the load file 
834. The web server engine 852, in this embodiment the 

35 Netscape engine 852, obtains data from the HTML reposi- 
tory 838 and the image repository 840 in accordance with 
various requests from the browser for different types of files. 

. _ Each.of the foregoing components ; will he described in more, 
detail in terms of function and operation in paragraphs that 

40 follow. The monitor process 856 is generally responsible for 
indicating the availability of server nodes 808-810 in per- 
forming data queries. The monitor is also generally respon- 
sible for receiving incoming messages from other server 
nodes as to their availability for servicing requests. 

45 The load file 834, upon which the monitor process 856 
reads and writes data, is a dynamic file in that its contents are 
updated in response to incoming messages indicating 
machine availability and the current load of the correspond- 
ing machine. The load file also includes static information 

50 components, such as the maximum load of each system. 
Generally, the actual executing load (current load) of a 
system is less than or equal to the maximum load (max load) 
as indicated in accordance with the load file. Each server has 
its own unique copy of the load file which is updated in 

55 accordance with messages which it receives from the other 
nodes. Below is an example of an entry that may be included 
in the load file representing the information described above: 

SERVER, MAX LOAD, CURRENT LOAD. 

60 The configuration file 830 may be a static file physically 
located on one of the server nodes 808-810 with a copy 
replicated on each other server node. Generally, this file is 
created prior to use of the system. It may specify which 
servers may service requests based on weighted parameters 

65 of a particular search domain associated with a particular 
server. Below is an example of an entry in a configuration 
file: 
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DOMAIN/PARTITION, SERVER, DOMAIN various portions of HTML commands, the parse tree struc- 

WE1GHT, SERVER WEIGHT lure reflects this in its representation of the parse tree which 

A • . . .. , , . is cached in the PHTML execution tree 846. Upon a sub- 

J££T t may be a normalized value represent- for £ a ^ 

^TT iated ^P^ n g^ eslfor 5 expanded version is retrieved from the PMTvfL execution 
this associated search domain or partition. This domain ^ M6 l0 mcrease s ^ em efficienc ^by decreasing 

weight is based on the median time to service a request in user response time for the subsequent query, 

that domain based on the analysis of past data logs, for The finit time a user makes a t ^ me bn)wser ^ 

example as normalized by the number of listings in the a request is received by the webserver engine 852 which 

domain. Similarly, server weights may represent the cost interacts ^ me parser m For a parlicular ^ UGSl 

associated with processing a request on a particular server. a PHTML file is obtained and executed from thePHTMLfile 

The domainyparUuon indicates a portion of the search slore 544 ^ e expailded version of the PH tml file is 

domain upon which a user query may be performed that is cacbed m me PHTML execution tree 846. In response to a 

associated with a particular server. user's request, an HTML page is generally constructed and 

Other particular embodiments of the load and configura- 1$ cached in the page cache 848. Generally, constructed HTML 

tion files may include additional or different information in pages are stored in the page cache 848 if the amount of time 

accordance with the particular policies and data required to taken to produce the resulting HTML page is greater than a 

implement the policies, such as request routing. predetermined threshold. Implementations of the page cache 

In this particular embodiment, an incoming request may may implement different replacement schemes. In one p re- 
be processed by one of a plurality of parsers 858 on each of 20 ferred embodiment, the page cache implements an LRU 
the server nodes. The parser 858 generally transforms the replacement scheme. Additionally, the threshold, the amount 
user input query into a form used by other components, such of time used to determine which pages are stored in the page 
as the request router. The request router generally receives cache, may vary with system and response time require - 
an incoming request as forwarded by the hardware router ments. 

806 of FIG. 2. The request router subsequently uses the load ^ When processing an incoming user request which results 

file and the configuration file to decide which server node in returning an HTML page to a user, a particular search 

808-810 a request is routed to based on the load and the order of the previously described caches and file systems 

availability of the server node, and the designated server for may be performed. Initially, it is determined whether the 

each partition or domain. Once a request is routed to one of HTML page to be displayed to the user is located in the page 

the server nodes 808-810, the query is performed producing 30 cache 848. If not, search results are obtained from the query 

data query information that may be cached, for example, in cache and the resulting HTML page is constructed and itself 

the memory of a data query cache 850. may be placed in the page cache 848. If a PHTML file is 

One use of the data query cache 850, as will be described required to be executed in constructing the resulting HTML 

in paragraphs that follow, is its use in improving the per- file, the PHTML execution tree 846 may be accessed to 

form an ce in response to a user request in a subsequent query 35 determine if there is a parsed version of the required 

that may use a subset or superset of the data stored in the PHTML file already expanded in the PHTML execution tree, 

data query cache 850. A superset or composition query is If no such file is located in the PHTML execution tree 846, 

one which is a boolean composite of several querying terms. the PHTML file 844 is accessed to obtain the required 

* A comp~osition query may be determ * PHTML file. The order in which these caches and file' 

and the request router 854 may decide to which server node 40 systems are searched is generally in accordance with a 

808-810 the composition query or other query is sent for graduated processing state of producing the resulting HTML 

processing in accordance with domain weights as indicated file. Caches associated with a later state of processing are 

in the configuration file. Reallocation of requests when a generally searched prior to ones associated with an earlier 

server is unavailable may be performed generally with a bias processing state in producing the resulting HTML file, 

toward the initial allocation scheme as indicated also by the 45 Also accessed by the parse driver 858 is a constructed ad 

configuration file. There is an assumption that reallocation of repository 842. As will be described in paragraphs that 

a request is on a transient basis, and that the initial allocation follow, the constructed ad repository generally includes 

scheme is the one to be maintained. This concept will be constructed advertisement pages which may include, for 

described in paragraphs that follow in accordance with example, text and non-text data, such as audio and graphic 

request routing and data query caching. 50 images to be displayed in response to a user query which 

Also shown in FIG. 4 are the PHTML execution tree 846, represent, for example, a yellow pages ad. The webserver 

the page cache 848, and the PHTML file store 844. engine 852 accesses information from the image repository 

Generally, the PHTML execution tree 846 includes an 840 and HTML repository 838. Generally, the image reposi- 

expanded version of a PHTML file requested from the tory 840 includes various graphic images and other non-text 

PHTML file 844 as the result, for example, of a user query. 55 data which may also be directly accessed by the webserver 

PHTML generally is a modified version of the HTML engine 852 in response to a user request, as by a user request 

language, which is a markup language according to the for a specific URL. Similarly, the HTML repository 838 

Standardized General Markup Language (SGML) standard, includes various HTML files which may be provided to the 

capable of interpretation by browsers, such as a Netscape user, for example, in response to a user request with a 

browser. PHTML generally is a scripted version of HTML 60 specific URL which indicates a file, 

with conditional statements that provide for alternate inclu- Included in each of the server nodes 808-810 are one or 

sion of blocks of HTML code in a resulting HTML page more parsers 866 which perform, for example, parsing of the 

transmitted to a browser in accordance with certain run time text of a user data query request. FIG. 4 includes some of the 

query conditions. The expanded version of a PHTML file software components as included in the parser 866. The 

may be described as a parse tree representing parsed and 65 components of the parser 866, which are described in more 

expanded PHTML files. For example, if a PHTML file detail in the following paragraphs, generally communicate 

conditionally includes accesses to other PHTML files or using a generic object dictionary 860. The parser may 
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include a parse driver 858 which performs the actual parsing 
of a user query. The parse driver 858 interacts with the query 
engine 862 once a request has been parsed to formulate a 
data query which is further passed to the data manager 864. 
As previously described, the data manager 864 generally 
interacts with a database to actually retrieve the data to be 
included in the resultant data query as displayed to the user. 

The parse driver 858 generally uses a data schema 
description to interpret various data fields of the generic data 
objects. Generally, abstraction of the data interpretation into 
the data schema description enables different components of 
the parser 866 to operate upon and use generic data objects 
without requiring these components require code changes or 
recompilation in cases of the introduction of new data 
presentation types. Components which need to know the 
details of the generic data object, such as the parse driver 
858, to perform certain functions, do this on a per- 
component basis using data schema descriptions to interpret 
a generic data object. This technique insulates code as 
included in the parser 866 from the introduction of new 
presentation types which may be represented as generic data 
objects. 

One common use of the GTE Superpages Internet site is 
to perform a data query. In performing a data query, a user 
enters data query information, as in fields 1802-1808 of 
FIG. 3, or may select other detailed search options, such as 
searching by distance, as included in field 1808. In this 
embodiment, data field 1802 is a category query field by 
which queries may be performed in accordance with speci- 
fied search categories that may be associated with business 
listings included in the yellow pages database. Additionally, 
field 1802 also includes predetermined top categories, as 
may be determined by examining log files in accordance 
with user query selections and search criteria. In this 
embodiment, selection of the "top categories" of the field 
1802, as by left-clicking with a mouse button, causes the 
interface 1820 of FIG. 9 to be displayed in a user's browser. 

1 Referring now to FIGS. 9 and 10, shown is one embodi- 
ment of a user interface for displaying a first page of the top 
query categories 1820. Generally, these categories are asso- 
ciated with the various business listings and are tags by 
which a user may perform queries. In this embodiment, for 
example, the user may select the "top categories'* from the 
initial interface as included in the field 1802. 

Referring now to FIG. 11, shown is one embodiment of a 
user interface for displaying a "search by distance** option. 
In this embodiment, this user interface screen may be 
displayed by selecting "detailed search** from the field 1808 
from the initial user interface 1800. For example, the user 
interface 1830 may be displayed if the user wants to perform 
a data query for specified categories and certain distance 
criteria. As shown in the example of user interface 1830, a 
data query may be performed for restaurants within five (5) 
miles of Boston, Mass. This query is performed when the 
user selects the "Find It" button 1832 as included in the user 
interface 1830. In this embodiment, a first screen 1840 of the 
data query results is shown in FIG. 12. 

Referring now to FIG. 13, shown is an example of one 
embodiment of a user interface display 1850 for performing 
a user query in accordance with user-specified search crite- 
ria. User interface 1850 of FIG. 13 is the interface 1800 of 
FIG. 3, but with user-specified data query information 
included in various data fields. In FIG. 13, a data query is 
performed for "shoes'* as the category 1802 for "Boston, 
Mass." in field 1804. The query is performed by selecting 
the "Find It" button of field 1806. The resulting screen 
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displayed in response to selection of the "Find It" button is 
included in FIG. 14. 

Referring to FIG. 14, shown is one example of a screen 
display in response to a performing a user query. The screen 
results 1860 may include displayed summarized business 
listing information in accordance with the search criteria 
previously specified in FIG. 14. Various business listings 
may be grouped together in categories. In this example, 
relating to "shoes", are 154 business listings included in 
thirteen (13) categories. From this listing of thirteen (13) 
categories, the user may select one of these relating to shoes. 
For example, selection, as by using a mouse, of "custom 
made shoes" 1862 results in the screen display of FIG. 15. 

Referring now to FIG. 15, shown are the business listings 
relating to the user-specified search criteria selection relating 
to "custom made shoes". From this screen 1870, the user 
may further select one of the businesses for more informa- 
tion pertaining to the business, such as directions and 
business-provided advertisements. 

Referring now to FIGS. 16 and 17, shown is one embodi- 
ment of a user interface that may be displayed when a 
business or advertiser updates a business listing. This screen 
may be displayed, for example, by selection of the "add or 
change your listing" option 1812 of FIG. 3 of the initial user 
interface. A user interface 1880 provides data fields which 
allow a user to enter in information, such as a telephone 
number corresponding to a business listing. Corresponding 
business listing information is then updated. In this example, 
a phone number 617^832-5000 is entered into field 1882 to 
retrieve business listing information corresponding to this 
phone number. By selecting the phone number field that is 
filled in with this phone number, the resulting screen of FIG. 
18 is subsequently displayed to the user in this embodiment. 
The phone number corresponds to a business as displayed in 
FIG. 18. If this is the correct business, a user may select a 
displayed business, for example, by clicking on the "match- 
ing business" information of FIG. 18. In response to select- 
ing the "matching business" information, the screen display 
of FIGS. 19 and 20 may be displayed to a user. To update the 
basic listing information associated with the business, selec- 
tion of field 1890 of FIG. 20 results in display of the screen 
of FIG. 21 where the user has the option to either update the 
business information or change categories. If business infor- 
mation is selected, FIG. 22 may be displayed. FIG. 22 
includes the business listing information that may be 
updated, such as a street address or e-mail address associated 
with this business listing. 

Referring back to FIG. 16, a section of the displayed 
interface 1883 indicates options for creating a website linked 
to a particular business listing. Note also that in some 
embodiments, it is possible to enhance a business listing 
and/or link a listing to a pre-existing website or to one that 
is created. 

The foregoing user interfaces and display results may 
vary with embodiments and user-specified search criteria. 
Various other user interfaces and other techniques known to 
those of ordinary skill in the art for specifying user search 
criteria may be used in other embodiments of the invention. 

Referring to FIG. 23, shown is an embodiment of the 
request router 854. In this particular embodiment, the 
request router 854 may be executed within a Netscape server 
process space and may be invoked when a user, via a 
browser, makes a request which results in a PHTML file 
being executed. The PHTML files, as generally included in 
the PHTML file store 844, are in the form of a script 
activated when a server node 808-810 is forwarded a user 
request. 
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The request router 854 is generally responsible for routing PHTML execution tree accessed by the particular processor, 

a request to the proper server node in accordance with data Subsequent requests which are also processed by the same 

stored in the configuration and load files. The request is also parser may access the cache parsing results stored in the 

forwarded to one of the plurality of parsers for processing PHTML execution tree. 

once the proper server node has been located. In this 5 In this particular embodiment, the request processing 

embodiment, the request router 854 may include several model includes a plurality of parsers and a plurality of 

threads of execution as shown in FIG. 23, which operate worker threads. Using this request processing model, an 

under the control of, and in the same process space as, the incoming request is associated with a particular worker 

Netscape browser. As shown in FIG. 23, the request router thread which then forwards the request to a parser for 
854 generally includes a housekeeping thread 880, a router 10 processing. Once this request has been associated or for- 

thread 882, and one or more worker threads 884. Generally, warded to a particular parser, the worker thread is disasso- 

the housekeeping thread 880 is responsible for maintaining ciated with the request, and is then available for use in the 

a parser status table 886 and a parser queue 888, both of pool of worker threads. The number of parsers and worker 

which are further described below. threads may be tuned in accordance with the number of user 
The router thread 882 generally responds to the monitor 15 ^ests One point to note using this model is that the 
process changes as recorded in the various data files with and P" 8 ? *** disassociated and thought 

regard to server node availability. The router thread 882 of 35 ^^t processing units rather than as a unit in which 

reads data from the configuration and load files, and main- a WOrk ? r mread " ******** ™ ih a Particular parser for 

... % . t . , j p processing an entire life of a request, 

tains an m-memory copy for use by the various threads of „ c . 4 ™„ - . , M . , „ 

the request router 854. The router thread 882 updates the a * u k ^ * * bl f^ a ^ m of 11 an 
^ r r t . _i i j n • embodiment of the BackOffice component 818. Generally, 
m-memory copy of the configuration and load files in me Backoffice componciIt debase H92 which 
accordance with predetermined node fail-over and providcs data> fof examplej to me Front End 804 
reallocationHof-request pohcies. For example, if in reading mrough connection 822 . The database 892, as stored in the 
the configuration and load files, the router thread 882 BackOffice component, may be updated, as through a web- 
determines that a first server node is at maximum utilization, ^ server via a connection to a user. Such a connection as 896 
the router thread updates its in-memory, server-node, local may be used, for example, when a modification is made to 
version of the files. The router thread determines not to an entry to correct typographical error. A user may connect, 
forward requests to the first server. When the first server such as via a browser, using connection 896, to the web- 
node's actual utilization decreases and is now available for server 894 included in the BackOffice component. The 
processing additional requests, the router thread accordingly 30 database 892 is then accessed and updated in accordance 
updates its in-memory copy. with requests or updates made by the user. 

Each of the worker threads 884 is initially forwarded a Other embodiments of the BackOffice component may 

request which arrives at a server node. The worker thread include other software components than those displayed in 

884 makes the decision whether the request should be routed FIG. 24. Additionally, a user may update entries included in 

to another node. The worker thread 884 makes this decision 35 database 892 using techniques other than by a connection 

generally in accordance with the contents of the configura- 896 via a webserver to the database 892. As described in 

tion and load files as previously described. If a request is other sections of this description, different types of updates 

determined to be routed to another server, the worker thread to database 892 may be performed in different embodiments 

forwards the request to another worker thread on another of the invention. For example, the database 892 may be 

server node. If the worker thread does not forward the 40 updated on a per-entry basis by a variety of users connecting 

request to another server, the worker thread determines via multiple webserver connections. Additionally, periodic 

which parser to send the request to for further processing. updates, for example, for particular data set may be provided 

The list of available parsers is stored in the parser queue 888, from a particular vendor, and accordingjy integrated into 

which in this particular embodiment is implemented as an database 892 through a database integration technique rather 

AT&T System 5™ with a system message queue. The parser 45 than having a user manually enter these updates such as via 

queue is generally maintained by the housekeeping thread a connection to the webserver 894. 

8*0- The connection to the Front End Server 822 may be used, 

It should be noted that the Netscape™ or other HTTP for example, to load a new copy of the database 892 into the 

server provides as a service the dispatching of requests to the Front End Server Primary and Secondary Databases 812, 

various worker threads. Other implementations may provide 50 814 as shown in FIG. 2. The way in which these updates may 

this function using other techniques such as callback mecha- be sent across the connection 822 to the Front End Server 

nisms which dispatch the user requests to one of the plurality may be as previously described in terms of database opera* 

of available worker threads 884. Generally, the parser status tional commands which perform updates from the computer 

table 886 includes information about use, availability and system which include database 892. For example, in one 

location of each of the plurality of parsers on each server 55 embodiment, the database 892 included in the BackOffice 

node. The parser status information may be used in deter- component and both the Primary and Secondary Databases, 

mining where to route requests for example, as performed by as included in FIG. 24, are Oracle™ databases. Oracle 

the worker thread 884. The parser status information as provides remote database update and access commands 

included in the parser status table 886 may be used to route which allow for remote database access and updating, such 

requests based on an adaptive technique similar to the 60 as update requests from the database server node 892 to 

adaptive caching technique which will be described in update the Primary Database 812 as stored in the Front End 

paragraphs that follow. This may be particularly useful in Server 804. In this embodiment, updates as made to the 

systems with multiple processors, for example, those which database 892 are "pushed" to the Front End Server 804 via 

certain CPUs are dedicated processors associated with pre- the connection 822. These modifications are pushed via 

determined parsers. For example, as particular requests are 65 database-provided update techniques such as those included 

processed by particular parsers, each associated with a when sending the operational table commands to the Front 

particular CPU, the parsing results may be stored in the End Server 804. 
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In this particular embodiment when information is sent 
via connection 822 to the Front End Server 804 from the 
BackOffice component 818, error messages and other status 
codes may be sent back to the BackOffice component 818 in 
accordance with an indication as to whether a data transfer, 
for example, has been successfully completed. 

Referring now to FIG. 25, shown is an embodiment of a 
general process by which data that is transferred from the 
BackOffice 818 to the Front End Server 804 is further 
integrated into other data stores within the Front End Server 
804. Data is stored in the BackOffice component in this 
particular embodiment in a normalized dataform, as will be 
further described in paragraphs that follow. These normal- 
ized data changes are transfered to the Front End Server 804 
from the BackOffice component in one of several forms. For 
example, the entire database may be transferred to the Front 
End Server 804. Additionally, changes or updates to par- 
ticular entries may also be transmitted to the Front End 
Server 804 from the BackOffice component rather than 
updating or overwriting the entire copy of the database as 
stored in the Front End Server 804. Each of these types of 
database updates from the BackOffice component to the 
Front End Server 804 may be done in accordance with the 
number of transactions or updates to be performed. This is 
further described in other sections of this description. 

Data which is stored in the Front End Server 804 may be 
stored in a normalized data format 900. Extraction routines 
902 operate upon this normalized data to produce de nor- 
malized data 904 and markup language files 906. The 
markup language files 906 serve as input to information 
retrieval software 908 which outputs term lists 836. As 
known to those skilled in the art, a markup language file 
generally includes tags which represent commands or text 
identifiers for processing the contents of the file. For 
example, Structured Generalized Markup Language, 
SGML, is a standard based markup language known to those 
skilled in the art. 

The process depicted in FIG. 25 is performed once data 
has been received in the Primary Database 812, and is first 
stored in the Primary Database 812 in normalized data form, 
as in the normalized data store 900. Extraction routines 902 
examine the normalized data store 900 and rearrange the 
information to place it in the denormalized data form, also 
included in the Primary Database 812 of this embodiment. 
These changes or updates for the normalized data which are 
transformed into the denormalized data form are integrated 
into the denormalized data store 904. Additionally, the 
extraction routines 902 produce markup language files 906 
which are primarily used by the information retrieval soft- 
ware to produce identifiers and corresponding words or 
terms upon which a query may be performed. These lists of 
key words or terms which may be searchable or retrievable 
and the corresponding record identifiers as included in the 
denormalized data store 904 may be stored in a list structure 
as included in the term list data store 836. 

Generally, the markup language files include one file or 
document per business for which there is an advertisement, 
for example, in this particular embodiment Each of the 
markup language files 906 includes markup language 
statements, such as SGML-like statements, with tags iden- 
tifying key data items in the document for each business. In 
this particular embodiment, the information retrieval soft- 
ware is Verity software which uses as input markup language 
files 906. Additionally, Verity uses its own schema file by 
which a user indicates what key words or terms as indicated 
in the markup language files are searchable and which of the 
data fields contain retrievable information. "Searchable" as 
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used herein means fields or key words and terms upon which 
searches may be performed, like index searching keys. 
"Retrievable" as used herein generally means fields or 
categories with associated data that may be retrieved. All 
searchable fields have a tag, such as a business name or city. 
Identifiers are generally produced by the information 
retrieval software 908. Verity™, in this particular 
embodiment, produces term lists 836 in which there exists a 
list for each particular key word, term or category followed 
by a chain of identifiers that indicate the record number in 
the denormalized data store 904. Additionally, associated 
with each element in the term list which indicates a record 
in the denormalized data, retrievable data associated with 
that record may also be included. For example, if the field 
"zip code" includes a tag as included in the mark-up 
language file 906 which indicates that this particular field is 
searchable, it may be desired that whenever a user wishes to 
do a search for "zip code" what is actually retrieved or 
displayed to the user is the city and the state. Accordingly, 
in this instance, the term list and the term list data store 836 
contain a list corresponding to the key word "zip code". 
There is a term list for each particular value of a zip code. 
Attached to each key word "zip code" and the particular 
value may be a list or a chain of identifiers. Associated with 
each identifier on the chain may be associated data, such as 
the city and state, which may be retrieved when a particular 
zip code is searched. 

Other types of data may also be included in other pre- 
ferred embodiments of the term lists. For example, the data 
included in the term lists may be data that is also needed in 
performing search optimizations, weighted searches, or dif- 
ferent types of searches, such as proximity searches. This 
data may further be stored in the various data files and 
caches of the Front End Server as needed in accordance with 
each implementation, for example in accordance with the 
types of searches and data upon which queries may be 
performed or otherwise operated upon by the Front End 
Server. 

Referring now to FIG. 26, shown is a detailed description 
40 of one embodiment of an example of normalized data, as 
may be stored in the BackOffice component and one copy in 
the Primary Database 812. Generally, in the Primary and 
Secondary Databases 812 and 814, respectively, of FIG. 2, 
the Primary Database 812 includes both normalized and 
denormalized data form, and the Secondary Database 814 
includes only denormalized data form. Normalized data is 
that representation of the data in which each data relation is 
represented independent of other relations. Generally, denor- 
malized data is the antithesis of a normalized data in which 
one data relation represents all relations. Different databases 
may be of different degrees of normalized and denormalized 
data. The BackOffice component 818 generally stores the 
data in normalized data form of a certain degree. Similarly, 
the databases used in this server store the data in a form of 
a normalized form also of a certain degree and additionally 
in a denormalized form for search performance optimiza- 
tions on performing data queries. In one embodiment, for 
example, the data is stored in third degree normal form. 
Additionally, in the denormalized form, sets of data may be 
60 stored together within a single field, such as multiple mailing 
addresses. Other embodiments may have one field per 
address. This may prove to be advantageous, for example, 
for high performance and better flexibility in systems subject 
to multiple and diverse data sources, and a high rate of 
modifications. 

As shown in FIG. 26, for example, each particular busi- 
ness entry may have a unique identifier, (ID). Additionally, 
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three pieces of information may be stored for each particular parser after interaction with the data manager and query 

business. The normalized data form may look as in FIG. 26. engine to select a specific number of items to be displayed 
In this particular example, there may be a separate table for to the user. The HTML page may be stored in the page cache 
each ^corresponding to a business and its business address 848. The page cache generally includes a naming convention 
910. Additionally, there may be two other data tables of 5 such as a file system in which the name of the file cone- 

information ako indexed t by each particular business ID, nds t0 me ^ parameters of tbe query . m 

Generally as mdicated in FIG. 26, the normalized data ^ of ^ P 

representation for each business associated with a particular . 

ID is represented as a separate data relation independent of ^ WW engine 862 is generally responsible for per- 

the other relations. 10 forming any required sorting of the query information or 

The conceptual opposite of normalized date is denormal- subsetting and supersetting of information. Generally, the 

ized data, as depicted in FIG. 27. Referring now to FIG. 27, WW en S ine 862 retrieves various identifiers which act as 

shown is an example of denormalized data stored in table kevs mto the Primar y Database 812 or Secondary Database 

916. In this example of denormalized data, for each ID 814 for accessin g particular pieces of information in 

associated with a business, the business address, email and 15 res P onse to a ^ WW- After the query engine 862 for- 

telephone number, may be stored in a single record. In other mulates retrieves various identifiers, for example as 

words, one data relation, which is a single record in the table frora tbe term 1LsLs > which correspond to a particular user 

916, represents all relations for one particular data set, such WW> this WW information in the form of term fist and 

as the ID corresponding to a business. Various degrees of retrieved information may be stored in the data query cache 

denormalized and normalized data as known to those skill in 20 850 * ^ technique similar to the page cache query-to-filename 

the art, may be used. The optimal degree of normalized and mapping technique may be used to map a particular query 

denormalized data forms may vary with each particular request to a naming scheme by which data is accessed in the 

implementation and embodiment. data WW ca che. The technique for forming this name is 

Referring back to FIG. 20, it may generally be noted that ^ described * other t***™™ <* application, 

the BackOffice component 818 may include one or more Additionally, data which is stored in the data query cache 

database servers 892. A user may directly interact with the 850 mav j« compressed or stored in a particular format 

web server 894 included in the BackOffice component via which facilitates easy retrieval as well as attempting to 

connection 896 which, for example, may be a network optimize storage of the various data queries which are 

connection of a user accessing the web server through the 30 cacned > ^ discussed in other portions of this application. 

Internet. The user may also interact directly with the Back- In the following FIGS. 28-30, shown are flowcharts of 

office component through the Front End Server Connection method steps of embodiments for performing processing in 

822. various components of the previously described system of 

In this embodiment, the particular type and number of FIGS. 2 and 4. 

data fields may vary with embodiment. Additional structure 35 Referring now to FIG. 28, shown are steps of one embodi- 

may also be imparted to data fields, such as a telephone ment of a method of processing a request in the system of 

number may include an area code and exchange component. FIGS. 2 and 4. At step 920, the Webserver engine invokes 

Additionally, interactions between the Primary Database the Request Router in accordance with the .PHTML MIME 

812 of the Front End Server 822 and the BackOffice com- (Multipurpose Internet Mail" Extension). At step 922, the 

ponent may be driven or controlled by the BackOffice 40 Worker thread as included in the Request Router is initially 

component. For example, when there is an update to be forwarded the request for processing. At step 924, a deter- 

performed to the Primary Database server 820, an automatic mination is made as to whether or not this request is serviced 

transfer of the new information may be transmitted to the by this node in accordance with the information included in 

Primary Database 812 by the BackOffice component. Data the configuration and load files. If, at step 924, a determi- 

may be transmitted to the Primary Database 812 using 45 nation is made that the request is not to be serviced by this 

connection 822. Additionally, connection 822 may be used node, the request is forwarded to another server node in 

to provide feedback or status information to the back office accordance with the load and configuration file information, 

component 818, for example, regarding success or failure of If, at step 924, a determination is made that this request is to 

a data transfer using connection 822. be serviced by this node, control proceeds to step 926 where 

As generally described, the PHTML files 844 of FIG. 4 50 the Worker thread allocates an available parser from the 

are generally HTML instructions as interpreted generally by parser queue to process the incoming request. At step 928, 

a browser with additional embedded processing instructions. the incoming request is passed to the designated parser for 

Generally, the PHTML execution tree 846 may be imple- processing. 

merited as a C++ applet class with various execute methods Referring now to FIG. 29, shown is a flowchart of one 

which are conditionally performed based upon the evalua- 55 embodiment of method steps as may be performed by the 

tion of certain conditions as indicated in the PHTML script- parser. At step 940, the parse driver of the parser parses the 

ing language statements. Each of the PHTML files 844 may incoming request. In this embodiment, the query request that 

be expanded and evaluated in accordance with the particular is parsed is included as a URL parameter that is processed 

conditions of the user request. The first time a PHTML file by the parse driver. For example, if the query includes syntax 

is accessed, it is expanded and the expanded version is 60 errors, the parse driver will detect and report out such errors, 

placed m the PHTML execution tree 846 of FIG. 4. Subset At step 942, a unique file name is determined in accordance 

quent accesses to the same PHTML file result in the con- with the query request This filename corresponds to the 

ditional evaluation of the stored and expanded PHTML file display results that may be included in the page cache. It 

in accordance with the run time performance and evaluation should be noted that this filename is unique for a particular 

of a user request, as from browser 824. 65 user query and in accordance with "look and feel" param- 

An HTML page is generally formed and displayed to the eters of the display results. For example, "look and feel" 

user. For example, the HTML page may be formed by the refers to parameters that describe the displayed results, such 
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as number of business listings displayed in an HTML page, In this particular embodiment, the Standard Industry 
the particular starting point of the displayed results with Classification (SIC) may be used to indicate various name 
regard to the resulting data set For a given resulting data set categories and synonyms. These various name categories 
corresponding to a user query, on a particular type of user and synonyms are produced, for example, by the extraction 
display window, 15 items may be displayed. The same query 5 routines which produce the markup files, as used in this 
performed by a second user from a different display window particular embodiment by the information retrieval software. 

boui cases is d^erent even^ ^ mvalcnls for ^ m olher * fcned ' 

in forming each of the HMTL pages is different. The page embodiments 

cache may include a different HTML page for each of the 15 in It . . * M . J t . L , 

and 17 item displays 10 It should generally be noted that in the various descnp- 

A determination is made at step 944 as to whether the 1°^^^ 

page cache includes the data in the filename determined at ^fP™*? ™>> arc UP*^* on ™ m ™- 

step 942. If a determination is made that the data is included ^ Ch *°f ° f ^ b ™? A 0ther P rcferred 

in the page cache by the existence of the file, control 15 raa X h ™ 

proceeds fo step 946 where the data in the filename is " ^ St0rcS 1DCluded f be f ront ^r 804. TTiese 

retrieved from the page cache. Control proceeds to step 956 technM l ues ma y var ? wth implementation, 

where the resulting HTML including the data in display ^ architecture described in FIGS. 2 and 4 is a highly 

format is delivered to the user's browser. optimized, distributed, fault tolerant, collaborative architec- 

If a determination is made at step 944 that the data is not 20 ^ P rimar yP ur P°f of this architecture is to support 
in the page cache, control proceeds to step 948 where a a ^ Volume of whicn ma Y be performed for 
deterniinationismadeastowhetherornotthereisaPHTML exam Ple, through the Internet. In this particular 
file in the PHTML execution tree. If a determination is made embodiment, the databases may include business 
that the expanded PHTML representation for this request is formation, such as for specific businesses or classifications 
included in the PHTML execution tree, control proceeds to 25 busmesses * Additionally, data queries may be performed 
step 950 where the expanded PHTML representation is bdscd on charactenstlcs of the various businesses, such as 
retrieved. Control proceeds to step 954 where portions of the ^cation name, or category. Furthermore, the architecture 
PHTML file are executed in accordance with the user query herein supports a flexible presentation of these 
to obtain data to produce the resulting HTML page by businesses, based on business agreements and service offer- 
invoking the Query engine for data results. The data results 30 mgS ' ^ architecture described herein uses various tech- 
are returned to the parse driver that creates a resulting niques and combinations to achieve high performance while 
HTML file returned to the user's browser at step 956. maintaining flexibility and scaleabflity. 
Additionally, it should be noted that the resulting HTML file ^ architecture as depicted in FIGS. 2 and 4 includes a 
may be cached in the Page cache in accordance with ^ of mU y redundant server nodes in which each node is 
predetermined criteria, as previously described. The result- 35 ca P a ble of responding to any search request. Each server 
ing HTML file is communicated directly to the user's no( * e communicates with all the other nodes, as previously 
browser. If a determination is made at step 948 that the described, establishing the health and availability of each 
PHTML file is not in the PHTML cache, control proceeds to server node. Incoming requests are classified by each node, 
step 952 where the PHTML file is retrieved from the ^ routed by the hardware router, using a classification 
PHTML file storage and subsequently expanded. The 40 ^beme held in common and by consensus. The nodes agree 
expanded PHTML file is stored in the PHTML cache. to a disjoint partitioning of requests to each of the server 
Control proceeds to step 954, which is described above. nodes in which one server node will service a set of classes 

Referring now to FIG. 30, shown is a flowchart of the of re( 5 uests mat no othe r node will generally service. A 

method steps of one embodiment for performing query number of complimentary techniques, including Subsump- 

engine processing. At step 962, the query engine receives an 45 ^ on and Hi ^y Redundant Caching, may be then used to 

incoming request, as forwarded by the parse driver in step adapt a particular node to a particular class of requests. Thus, 

954. At step 964, the data is retrieved for the "normal" me Jatency for request servicing by that node decreases as 

search results as appropriate from the data query cache, or additional user queries are performed for each particular 

using an alternate technique. Details of this step are c ^ ass °* requests. 

described in more detail in following paragraphs describing 50 Adaptive techniques, as those performed by the Front End 
the use of the data query cache. Generally, "normal" search Server 804, may be most effective when dealing with 
results refers to the resulting data set formed by business repeated requests or queries similar to those previously 
listing data associated with a well-defined geographic area. performed. Based on the adaptive techniques used herein, an 
In addition to "normal" search result data are other search initial search request may be the most costly in terms of 
result data that may not be associated with a single well- 55 system resources and search time. Therefore, other tech- 
defined geographic area, such as virtual businesses in the niques are used in conjunction with the adaptive techniques 
Internet These other search results that may not be associ- to further facilitate performing an optimal query in response 
ated with a single well-defined geographic area are to a user request. For example, common term optimization 
described in more detail in paragraphs relating to the data (CTO) is one technique which is used that generally takes 
query cache and its use. At step 966, other search data in 60 advantage of a statistical bias in both submitted queries and 
addition to the "normal" search data may be retrieved and result sets towards particular words or combinations of 
integrated into the resulting data set. At step 968, the result words. By anticipating particular word combinations or 
data set is formulated in accordance with the user query p recalculated result lists that match, the CTO matches the 
request, such as displaying results in a particular order or initiating search problem. 

beginning at a particular point. At step 970, the resulting data 65 In the embodiment described herein, the Front End Server 

set is returned to the parse driver for formatting in a display 804 has a data set domain which includes electronic yellow 

format in an HTML file. pages and advertising requiring a high degree of flexibility 
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in the presentation of data. Data is generally presented using documents may be manipulated as generic objects- As 

the look and feel of business partners in each business listing discussed in more detail below, representing each business 

which may have distinct requirements for presentation. listing as a generic object facilitates subsequent handling of 

Additionally, new modes of data presentation may be me business listings. 

defined on a monthly basis requiring updates to large num- 5 Referring to FIG. 5, a table 400 illustrates data storage for 

bers of data stored in the back office component in the a lura% of deDO rmalized objects in the databases 812, 

primary and ^ndary database. To support flexftnity, the 814 ^ differences between normalized and denormalized 

architecture ; described uses several techniques that also data ^ m more detail elsewherc ^ ^ 

support performance requirements of the particular data , . , . . . jl c . _V 

domain in this embodiment and application. Generally, n de ™ mall f d ^ **n* » optuniaed for fast performance 

techniques such as the generic object and the generic pre- 10 ^aps, foregouig some storage compaction, 

sentation language may be used to facilitate rapid introduc- . A P Iurallt y of ^ 406 represent a plurality of 

tion of new services and additional presentation data in a denormalized generic objects, each of which corresponds to 

variety of forms to a user. a business listing. A plurality of columns 412, 414, 416, 418 

Additionally, in the embodiment described in FIGS. 2 and represent various attributes of the denormalized objects. In 

4, each server may be fully redundant, and there are two 15 a P referre d embodiment, the first attribute 412, corresponds 

additional servers that are designated database servers which to an identifier for the objects 402,404,406 and thus identi- 

have additional supporting software and hardware for facili- ^ a particular listing. Each of the attributes contains a 

tating database access. Other embodiments of the invention number of fields and contains descriptor information iden- 

may include additional configurations of servers and data- tifying the type, size, and number of fields, 

bases in their particular implementation. 20 Attributes may be added to the normalized objects, or 

While including concepts and techniques described only to a specific subset thereof. A denormalized represen- 

herein, for example, the different databases and packages tation of any one of the objects 402, 404, 406 contains the 

commercially available which may be used, as known to same number of attributes as any of the other one of the 

those skilled in the art, vary with the type of data access objects 402, 404, 406. This allows the denormalized objects 

using searches to be performed. In this particular 25 to be transferred from the primary or secondary databases to 

embodiment, a relational database structure is used to store the data manager 864 in a string format wherein each object 

and retrieve information in the Front End Server 804. Other can be identified. Accordingly, if values for a new attribute 

embodiments may include additional types of database are added to only a subset of the objects, then the other 

storage using other commercially available packages or objects, outside the subset, will contain a null value or some 

specialized software which facilitate each particular appli- 30 other conventional marker indicating that the particular 

cation. attribute is not defined (or contains no data) for the objects 

[Generic Objects] in question. For example, assume that a new attribute 420 is 

The PHTML files 844 that are provided to the parse driver added. Further assume that the new attribute 420 only 

858 are scripts that direct the paise driver 858 to perform contains values for the object 402, but is not defined for the 

queries, view the results of queries, and provide information 35 objects 404, 406. In that case, data space for the attribute 420 

to the browser 824. In a preferred embodiment, the PHTML is still added to the denormalized version of the objects 404, 

files 844 are expanded into the PHTML execution trees 846 406, but no value is provided in the attribute 420 for the 

the first time the parser 866 accesses the PHTML files 844. objects 404, 406. 

The parse driver 858 accesses the PHTML execution trees Referring to FIG. 6, a table 430 represents data stored in 

846 during operation in a manner described in more detail 40 the generic object dictionary 860 corresponding to results of 

below. a search query provided by the query engine 862 or from the 

The scripts that are stored in the PHTML files 844 may data query cache 850 in the case of a previous search having 

include commands that are interpreted by the parse driver been performed. In the table 430, it is assumed that a search 

858, C++ objects that are executed, blocks of HTML code returns a plurality of objects corresponding to n categories 

that are provided by the parse driver 858 to the browser 824, 45 and up to m listings for each of the categories. The anno- 

and any other appropriate data and/or executable statements. tation o /Jt means the object corresponding to the jth category 

The PHTML scripts perform operations of objects in a way and the kth listing. In the case of the table 430 (and thus the 

that is somewhat independent of specific attributes of the generic object dictionary 860), the objects may be object 

objects and thus, as described in more detail below, provide identifiers. For example, the field 412 may correspond to an 

a generic mechanism for displaying and presenting many 50 object identifier of each of the objects 402, 404, 406. As 

types of objects. The PHTML scripts include conventional discussed in more detail below, the parse driver 858 uses the 

commands to include other files (such as other PHTML table 430 provided by the generic object dictionary 860 

files), conditional files/text inclusion commands, switch along with the PHTML execution trees 846, to provide 

statements, loop statements, variable assignments, random specific HTML code from the parse driver 858 to the 

number generation, string operations, commands to sort and 55 browser 824 of the user 802. 

iterate on attributes/fields of an object according to aspects Referring to FIG. 7, a diagram illustrates a portion 440 of 
thereof, such as the name, and logging values to files. The the PHTML execution trees 846. The portion 440 is con- 
specific syntax used for the PHTML scripting commands is structed using the scripts in the PHTML files 844 and 
implementation-dependant but includes conventional key consists of a plurality of nodes corresponding to the decision 
words (such as "if and "then") and conventional arrange- 60 points set forth in the PHTML scripts and a plurality of C++ 
ments of parts of the various types of statements. As objects and HTML pages that are executed and/or passed to 
described in more detail below, the scripts provided in the the browser in response to reaching a node corresponding 
PHTML files 844 are used to construct the PHTML execu- thereto. Thus, for example, a node 442 can correspond to a 
tion trees 846 that control the operation of the parse driver PHTML if -then-else statement having two possible out- 
858- 65 comes wherein one branch from the node 442 corresponds 
Each business listing may be represented as a document to one outcome (i.e., the conditional statement evaluates to 
stored in the primary and secondary databases 812, 814. The true) and another branch from the node 442 corresponds to 
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another outcome (i.e., the conditional statement evaluates to vides HTML data to the browser 824 and/or executes a C++ 

false). Such a structure may be implemented in a conven- object. The interpreter 454 also accesses a configuration file 

tional manner given a scripting language such as that 456 and a state file 458 which keeps track of the state of 

described above in connection with the PHTML language. various values during traversal of the PHTML execution 

That is, implementing such a tree structure using a scripting 5 trees 846. The interpreter 454 also receives other data that is 

language is straightforward to one of ordinary skill in the art ^ to traveRe me PHTML execution trees 846 and to 

using convention^ techniques in 1 1 straightforward manner. provide information to the browser 824. The other data may 

H,tSf^ for exam P le > dala fi" ** data mana S er 864 

databases 812, 814 as genenc objects facilitates modifying ■ t „ *■ tU . ' 4 J? tt , 4 

the documents, or a subset thereof, without modifying the n «*^ m * C gen^c object dicUonary 860. The state data 

parser 866. For example, if an attribute is added to some of 10 informauo i n su ^h as the number of iterations (in 

the objects, then it is only necessary to modify the objects the case of an iterative loop), the values of various environ- 

(schema and data) that will contain that attribute and to also mcnt **** omer vanables from PHTML execution trees 

modify the PHTML files 844 to include new scripting to and tbc values of other variables and data necessary for 

handle that new attribute. The scripting may include state- performing the operations set forth in the PHTML execution 

ments to determine if the particular attribute exists for each 15 lieGS 

object. For example, suppose the business listings were in The technique disclosed herein relates to a new data type 

black and white and then color was added to some of the which abstracts the data interpretation from the data typing 

listings. The color attribute could be added to some, but not by using data schemas. A novel approach is the use of this 

all, of the objects only in normalized form. Once the new data typing for rapid service deployment in search engines 

color attribute has been added, the denormalized versions of 20 for advertising services on the Internet. For example, new 

all of the objects would contain a data space for the attribute, presentation types may be introduced by an advertiser due to 

but the objects that do not possess a color attribute will have the large number of possible ways to present data to a user, 

a null marker. The PHTML files 844 can be modified to test An advertiser may wish to change the information displayed 

if the color attribute is available in a particular object (e.g., when a user performs a query that results in displaying 

to test for a null value) and to perform particular operations 25 information regarding the advertiser's business. If there are 

(such as displaying the color) if the attribute exists or, if the tens of thousands of advertisers which perform this task on 

attribute does not exist for a particular object, displaying the a monthly basis, this implies a very high rate of new 

object in black and white. In this way, the color attribute is presentation types which an online advertising service must 

added to some of the objects without modifying the parser be able to accommodate. Use of this generic data type in 

866 and without modifying existing objects that do not 30 GTE Superpages™ provides a flexible and efficient 

contain the attribute. approach to incorporate these additional and new presenta- 

For each query that is presented to the query engine 862, tion types for large numbers of advertisers, 

the query engine 862 determines whether the query is found Generally, this technique provides for rapid integration of 

in the data query cache 850 or whether it is necessary to new data types without requiring recompilation or code 

perform a query operation using the Verity software 35 changes in source code which uses instances of data that 

(discussed elsewhere herein) and the term list 836. In either include the additional data types. This provides for the 

instance, the results of the query are provided by the query flexible and efficient introduction of data changes, 

engine ,862 to the generic object dictionary 860 in a form set The generic data typing is optimized for performing 

forth above in connection with i the description of FIG. 6. The multiple data operations by providing a small subset of 

parse driver 858 and PHTML execution trees 846 then 40 possible operations or accesses upon any data of the generic 

operate on the generic object dictionary 860 to determine data type. Therefore, these small subset of operations which 

what data is displayed to the user by the browser 824. In are known may be optimized wherever there is a data access, 

some instances, the PHTML execution trees 846 may for example, within the parser. This is in contrast to a 

require the parse driver 858 to obtain additional data from non-generic data typing scheme which requires the intro- 

the databases 812, 814 through the data manager 864. For 45 duction of a new data type and additional associated access 

example, in instances where the categories corresponding to patterns. In a non-generic data typing scheme there is an 

the retrieved documents (business listings) are displayed, the unlimited and unknown number of access patterns for which 

PHTML execution trees 846 may cause the parse driver 858 optimizations must be performed on an ad-hoc basis as new 

to obtain information from the generic object dictionary 860 data types are introduced. Thus, when a new data type is 

that identifies each category and the number of listing 50 introduced, the possible accesses need to be analyzed and 

corresponding to each category. Then, the portion of the optimized. In addition, the technique described herein pro- 

PHTML execution trees 846 may cause the parse driver 858 vides for denormalized, flat, representations of the objects 

to use the data manager 864 to access additional information that facilitate rapid and efficient handling thereof, 

from the databases 812, 814, such as the names of the The parse driver 858 uses a data schema description to 

categories corresponding to the category identifiers provided 55 interpret the various data attributes and fields of the generic 

in the generic object dictionary 860. data objects. Generally, the abstraction of the data interpre- 

Referring to FIG. 8, the parse driver 858 is shown in more tation into the data schema description enables different 

detail. An instantiator 452 creates the PHTML files 844 and components of the parse driver to operate upon and use 

constructs the PHTML execution trees 846 from the generic data objects without having these components 

PHTML scripts the first time the PHTL is invoked by the 60 require code changes or recompilation due to the introduc- 

parse driver 858. Instantiation includes reading the PHTML tion of new presentation types. Components which need to 

files and constructing trees, such as that shown in FIG. 7, know the details of the generic data object, such as the parse 

based on the PHTML scripts provided in the PHTML files driver 858, to perform certain functions, do this on a per 

844. As discussed above, constructing such trees from a component basis by using the data schema description to 

scripting language is generally known in the art. 65 interpret a generic data object. This insulates code from the 

An interpreter 454 accesses the PHTML execution trees introduction of new presentation types which are repre- 

846 and, based on the information provided therein, pro- sented as the generic data objects. 
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[Query Cache and Request Allocation] composition of several subsets resulting in a superset, or the 
When performing the routing of particular requests, such extraction of a subset from a recognized result set. In 
as data queries, existing systems may perform request rout- subsumption, the presence of an additional conjunctive 
ing to a particular server in a distributed computer system ("and") search term corresponds to the formation of a subset 
without reference to certain available factors, such as an 5 from the superset described without the additional term. The 
initial partitioning of the entire domain, or an assumption presence of an additional disjunctive ("or") search term 
that data queries will be cached in a data query cache and corresponds to the identification and composition of existing 
subsequently reused for additional searches. Generally, subsets each described by one of the disjunctive clauses, 
using the concepts which will be described in paragraphs Consider the following example of tie use of the data 
that follow, the larger the number of queries that are per- 10 °, ue ry cache and subsequent searches which use a subset of 
formed when routed to a particular node in accordance with me data stored in the cache. For example, suppose the first 
an initial allocation scheme, the quicker subsequent searches request results in a query of all of the restaurants within 
on this same particular node may be performed due to the th^Y OQ miles of Boston. This query data is placed in the 
use of the data query cache. data query cache. A second request results in a query of all 
This embodiment relates to concepts that may be included 15 me seafood restaurants within thirty (30) miles of Boston, 
in a variety of applications. One embodiment that includes The second request is routed to the same node as the first 
these is the GTE Super Pages on-line Internet tool that may request in accordance with loading configuration files, for 
be used to perform data queries. As an example, consider example, as shown on FIG. 4. The second query is per- 
using this tool to perform an on-line query of all French formed quickly by using the data query cache information 
restaurants within thirty (30) miles of Boston. Generally, 20 an( * searching for a subset of the cached data indicating 
GTE Super Pages performs this query returning search restaurants within thirty (30) miles of Boston for a subset of 
results to an on-line user. Concepts which will be described this first search data which indicates seafood restaurants, 
in paragraphs that follow may be generally used and adapted Subsequently, this second request query data which indi- 
for use in querying any search domain. cated all the seafood restaurants within thirty (30) miles of 
A worker thread classifies a request and performs query 25 Boston is also stored as a separate data set within the data 
partitioning in accordance with the URL information. For query cache. 

example, this may include data from the query request such I* should generally be noted that the data included in the 

as a specified state, zip code, or area code. The request router data query cache is placed in nonvolatile storage such that if 

854 receives an incoming request as forwarded by the the node were to become unavailable, data from the data 

hardware router. Within the request router 854, FIG. 4 is 30 cache may be fully restored once the node resumes service, 

generally machine-executable code which embodies the The composition query also uses the data in the data query 

concepts of an adaptive and partitioning scheme with regard cache. A composition query may generally be referred to as 

to routing requests. Use of this technique allows for high one which is a composition of several queries, for example, 

performance search optimizations that leverage and ensure when using several conjunctive search terms. For example, 

server node adaption to a particular class of requests. The 35 a request of all the French restaurants in Massachusetts, 

technique of adaptive query partitioning generally increases Texas and California is a composition query that may reuse 

the performance in terms of high throughput and low latency existing cached data from previous queries stored indi- 

where queries include Boolean search terms. This search vidually regarding restaurants in Massachusetts, Texas and 

optimization technique may include three components: California. A composition query is generally determined by 

query partitioning, highly redundant caching, and subsump- 40 the Parse Driver, and the request router decides to which 

tion. server node 808-810 within the Front End Server the 

Query partitioning is the strict classification and routing composition query is sent for processing in accordance with 

of a particular query based on its input term characteristics domain weights of the configuration file. Consider the 

to a node or a particular set of nodes. This information is following Configuration File information based upon the 

stored in the various configuration and load files, as 45 previous composition query: 
described in other sections of this application. Query parti- 
tioning ensures that any adaption a node undergoes based on 
the characteristics of queries that it processes is maintained. 
Specific nodes may serve specific query partitions. Caching 
and result set manipulation techniques may then be used on 
each particular node to bias each particular node to the query 
partition to which it has been assigned. 

Highly redundant caching is generally a technique that 

trades storage space against time by storing result sets along The Request Router may route the composition request to 

with subsets of these result sets. The highly redundant 55 either server 1 or 2. If the request is routed to server 1, data 

caching technique generally relies on the fact that the search may be cached regarding Massachusetts and Texas for reuse 

time to locate an existing result is generally less than that and a new query may be performed for the California 

amount of time which would result in creating the query information. If the request is routed to server 2, data may be 

result from a much larger search space. cached for reuse regarding California and new queries 

One highly effective set manipulation technique, referred 60 performed for the Massachusetts and Texas information. The 

to as subsumption, is especially important in the adaption of Request Router, based on the weights, sends the request to 

a particular node. Subsumption is generally the derivation of server 2 since the cost associated with performing the 

query results from previous results, which can be either a Massachusetts and Texas queries is less than the cost of 

superset of the requested result or subsets of the requested performing the California query. 

result. Subsumption is also the recognition of the relation- 65 In the above caching scheme, a particular domain is 

ship between queries and the determination of the shorted associated with a particular server node upon which data 

derivation path to a result set. That derivation may be the query caching is performed for designated domains. The 
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domain and server weights reflect the cost associated with closest ancestor may be used as a basis for starting to form 
processing a request on each node using the data query the resulting data set. In one embodiment, preprocessing 
cache. Accordingly, routing a request in accordance with insures that ancestor-based geography exists. In one 
these weights results in faster subsequent query times for implementation, that ancestor is a Verity term list associated 
those requests. 5 with a particular state. This implementation uses API calls to 
Reallocation of the requests when a server is unavailable retrieve the data identifiers corresponding to the resulting 
is performed with a bias toward the initial allocation scheme data to be included in the query results, 
as indicated by the Configuration File. There is an assump- If, at step 205, it is determined that there are one or more 
tion that reallocation is on a transient basis and that the data sets in the data query cache that correspond to one or 
initial allocation scheme is the one to be maintained. Con- 10 more of the parent terms, control proceeds to step 206 where 
sider the following server nodes (M1-M4) and the domains a cost is associated with each parent. One embodiment 
initially allocated to each node as indicated below: associates a cost with each parent term in accordance with 
Domains Dl and D2 allocated to node Ml. the number of listings of each parent term. This may also be 
Domains D3 and D4 allocated to node M2. normalized and used in a percentage form by dividing the 
Domains D5 and D6 allocated to node M3 15 number of ^^g 5 m me P^Dt domain by the total number 
Domains D7 and D8 allocated to node M4. ° f lvsta ^. \r ^ ^T' -^u P ercenta S e presents 
At a first time, node Ml becomes unavailable and the routers ^ P ro °abihty of a business hstmg belonging to die parent 
reallocate Domain Dl to node M2 and D2 to node M3. At S^Mf 8 ™ 8 ^ ntro \ to *fP 
a second time, node M2 also becomes unavailable. Domams „ t P t T f ™^ "TT? !f ^ n 
Dl and D3 are reallocated to node M3 in addition to 20 '^^^ 
domains D5 and D6. Domain D4 is reallocated to node M4 t f mi ™;f de ™*° n ^ eQ °f 15 a P? hed t0 P r °*«* 
in addition to domains D7 and D8. At a third time, node Ml rcsultin S data ^ Generally, the minimum cost den- 
is restored and node M2 is still unavailable. Domains Dl and ^b^i^Tern^ first ^ ^ 
D2 are reallocated to Ml in addition to Domain D3. „ T « . r, ,5 s . . , 4 . 4 . Al _ , 

p. • nc T-iiC j t^a 11 * j* j k mi t\ - 25 " should generally be noted that in other embodiments in 

Domains D5, D6 and D4 are allocated to node M3. Domains u - u *u _* j j * *l i_ u * , 

ri7 ™a r»c ™ „n™*-,i i haa n u- * , which other extended parentage thresholds are used, such as 

D7 and Do are allocated to node M4. There is a bias toward _ , * *i_ j * • *■ r .i_ . * . . 

, . : . ... , „ . , , . grandparents, the determination of the start data set in step 

ratonng the inrtial allocate scheme when a node becomes ^ £ ^ the data ^ fa dosest m terms of * 

available. Tins bias contributes to faster subsequent query ^ ^ ^ ^ f P ^ 

Umes upon re-entry of a server nodedue to the use of the x ^ in fe ^ prilniu f rankm basis ^ 

data query cache, and routing of subsequent requests to the „ k-Ipi-.- . • v . f s . . " JU . UJ& 

particular nodes in accordance with this bias DU ?^ r of l^gs bemg secondary m determining ranking. 

i u*i.*riij*t_jj. i Referring now to FIG. 34, shown is a diagram of one 

In paragraphs that follow, described are data query each- , . t 11ftf ' . . . 

* u ■ l j * • - . * * example used in step 210 for determining and applying the 

ing techniques as may be used in conjunction with the . . J • F T - . r ^ a ^ 1 J fl "& luc 

f . j -uj * best aenvaUon sequence. In this example, the query is for 

Sn. 6 ^^ teCbmqUeS - , . 35 Massachusetts AND RESTAURANTS AND FLOWER- 

Referring now to FIG. 33, shown is an example embodi- CUA , )C A . , . . # . , , . _ 

t*^„+ ~f 0 n A11 r _ iU , , f _c • j t SHOPS. As represented in state 230, it has been determined 

ment of a flowchart of method steps for performing a data tU *\* u * *u * ^- j * . ^- L • i j • 

query. At step 200, a determination is made as to whether a t ^ g ? T . " m 

data set in the data query cache corresponds to the current' ^ 

j ¥? * i j . , , extended to grandparents, and Massachusetts has been deter- 

query being made. If so, control proceeds to step 202 where An m - A in , / , . • . 4 . 4 f 

!{,• j . * t - j j r , « r . .40 mined to be the first ranking data set m terms of parentage 

this data is retrieved and used by the query engine in , , 4 . . 4 , 6 , 4 » *. . 

fnrmiii«^fin „ 4 . t J , \ , 77 and number of listings in the data set. At this point, control 

formulating the query results that are displayed to the user. nrrvv ^ — n f3 * * *• «»* , 

* * *i_ * * ♦ t u * 4 * proceeds to one of two states, 232 representine "Massachu- 

At this point, the processing stops at step 216. A ™ dcctaitdakftc" ^\ 

ic a i * * * - . V iaa it . j , . setts AND RESTAURANTS , or 234 representing Massa- 
If a determination is made at step 200 that no data set in . A wtmtt nu^ocunuc" tu * * * . 
the data query cache corresponds to the current query being _ chusettsAND ELOWERSHOK . Hie sUte to which con- 
made, control proceeds to step 204 where parents of me data 45 ^^ a ^^^^generallyonchoosmgtte 

™ a ♦ ™- a i #u- uj" * c .L me m^imum associated cost at each step. In this instance, 

Kit ™ S ^SS" P~ f ^ the number of elements in the data sets "FLOWERSHOPS" 

current query are ^determined by droppmg one of terrns. ( ^ -restauran^^ (state ^ ^ 

For example, if the query being made is for "Massachusetts ^ • , /- , t . . # f -V^ v , ^ f y t 

AND RESTAURANTS AND FLOWERSHOPS", each of < n ^0^00^^!^ ^ *J f Aen ^ b f f felemente ^ 

the three terms is sequentially dropped to form all combi- 50 SfS^SSS^ 

nations of two possible terms. In this instance, the set of d f ^ **^TAURA^TTS, control proceeds to ^ state ^4 

parents is the following- each busmeSS m the data ^ FLOWERSHOP 

» « , ## AKmnrcTA.inAxr^ is examined to determine if it is also in Massachusetts. The 

Massachusetts AND RESTAURANTS resulting data set forms the set of all business listings in 

Massachusetts AND FLOWERSHOPS 55 Massachusetts AND FLOWERSHOPS. In contrast, if the 

RESTAURANTS AND FLOWERSHOPS number of elements in the data set RESTAURANTS is less 

It should be generally noted that in this embodiment, a than FLOWERSHOPS, state 232 is entered and similar 

search is made for only the parent terms. Similarly, other searching of the data set is performed. From either state 232 

embodiments may go further in searching for results in the or 234, control proceeds to state 236 where searching of the 

data query cache by also forming grandparent terms, as by 60 data set elements is performed to produce the final resulting 

dropping two terms. This process can be repeated for any data set representing "Massachusetts AND RESTAU- 

number of terms being dropped and subsequently determin- RANTS AND FLOWERSHOPS". Generally, the approach 

ing if any data sets in the data query cache correspond to the just described is to advance to the next state which has the 

resulting terms. minimum cost associated until the final resulting data set is 

At step 205, a determination is made as to whether data 65 determined, 

results in the data query cache correspond to any of the It should also be noted that some of the determination of 

parent terms. If not, control proceeds to step 212 where a data sets as used in performing queries may be done as 
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preprocessing to partition the data sets. For example, in one Data sets that are stored in the data query cache and page 

embodiment, the data is partitioned by states. The adaptive cache each correspond to a particular search query. In other 

techniques as described with regard to the GTE Supcrpages words, a mapping technique may be used to map a particular 

application described herein include partitioning the data query to corresponding data as stored in the data query cache 

sets based on geography, particularly within each state. In 5 and the page cache. Generally, this mapping uniquely maps 

this instance, particular server nodes are designated as a data query to a name referring to the data set of the data 

primary query servers based on geographic location by state. query. In this embodiment, this allows quick access of the 

Additionally, as part of this, partitioning of requests, the data data set associated with a particular query and quick deter- 

query caches and term lists of identifiers are also partitioned mination if such a data set exists, for example, in the data 

according to state. In this embodiment, this partitioning is 10 query cache. 

done as a preprocessing step prior to servicing a request in Referring now to FIG. 35, shown is a flowchart of an 

that the identifiers are formed and placed on each dedicated embodiment of the steps for forming a name associated with 

server node. Similarly, other data partitioning may also be a data set, as may be stored in the data query cache or page 

performed as part of a preprocessing step. Generally, this cache. At step 240, a subset of query terms is determined 

partitioning may be determined based on expected data 15 such that a string representing a particular query is uniquely 

queries and data sets formed accordingly, for example, by mapped to a name corresponding to a data set. In this 

examining log files with recorded data query search histories embodiment, the subset of keys that are used in mapping a 

to determine frequently searched categories or combinations string corresponding to a query to a name of a data set 

of categories. include: 

A query request, as made by a user, is generally the m Proximity, City, State, Street, Zip, Category, Category 
combination of boolean operators and search terms. In this Identifier, Business name, Area code, Phone number, 

embodiment, the general form of a term in a query request Keywords, and National Account. 

^ Generally, "Proximity" represents the proximity in physi- 

key=value ca j distance to/from a geographic entity, such as a city, 
in which the "key" represents some category or search term, 25 "City", "State. Street", "Zip", "Area Code", "hone 
such as STATE. "Value" represents the value which this key Number", and "Business Name" represent what the keys 
has in this particular query. With regard to the previous semantically describe as pertaining to a business listing, 
example, "S=MA" may represent the query term STATE- "Category" represents a classification as associated with 
MA. Key-value pairs or terms may be joined by the logical each business, such as representing a type of business 
boolean AND operation, represented, for example, as "&". 30 service. "Category Identifier" is an integer identifier repre- 
The logical boolean OR operation may also be represented, senting a category id. "Keywords" indicate an ordering 
for example, by another symbolic operator such as ",". For priority for the resulting data set. "National Account" rep- 
example, when looking for either cities of ACTON or resents a business or service level parent-child relationship 
BOSTON, this may be represented as: where the national account indicates the parent. An example 

T-ACTON30STON 35 is a parent-child relationship between a parent corporation 

The number and types of "keys" varies with embodiment. and its franchises. 

For example, in this embodiment, keys include: (T) City, At step 244, a query string corresponding to a particular 

(13) Business Listing, (S) State, (R) Sort Order, (LT) user query is formed using the original string as formed, for 

Latitude, (LO) Longitude, and (A) Area Code. In this example, by the Parser of FIG. 2 The query string includes 

application, for example, LT and LO may be used to 40 only those terras which are included in the subset as iden- 

calculate data sets relating to proximity searches, such as tified in step 240. If the original string does not include an 

restaurants within thirty (30) miles of Boston. item that is in the subset, for example, since the user query 

The Data Query Cache 850, in this embodiment, generally does not include the item as a search term, that item is 

includes a "hot" and "cold" cache. In this embodiment, the omitted in forming the query string corresponding to the 

caching technique implemented is the LRU (Least Recently 45 data set. At step 248, this query string is used to determine 

Used) policy by which elements of the cache are selected for if a data set is located in the data query cache that corre- 

replacement in accordance with time from last use. These sponds to the current user query request. In this 

and other policies are generally known to those skilled in the embodiment, the data sets each correspond to a filename, 

art. Generally, the "hot" cache may include the most recently Thus, a lookup as to whether a data set corresponding to a 

used items and the cold cache the remaining items. In this so particular user query exists may be determined by perform- 

embodiment, each of the data query caches and other ing a directory lookup, for example, using file system 

caching elements as depicted in FIG. 2, may be fast memory services as may be included in an operating system upon a 

access devices, as known to those skilled in the art, used device which serves as a fast memory access or other 

generally for caching. caching device. 

It should generally be noted that in this particular 55 It should be noted that this technique may be used 

embodiment, the "hot" cache is implemented as storing the generally within the Superpages Front End Server and 

data in random access memory. This may be distinguished BackOffice to form unique names that correspond to particu- 

from the storage medium associated with the "cold" cache lar search terms. For example, one embodiment may include 

representing those items which are determined, in accor- services for operating upon the original query string as 

dance with caching policies such as the LRU, to be least 60 formed by the Parser to produce parents and grandparents of 

likely to be accessed when compared with the items in the the terms included in a query when performing the method 

hot cache which are determined to be more likely to be steps of FIG. 33 and 34 if there is no exact data set match 

accessed. in the data query cache. This may provide the advantage of 

In this embodiment, a double ended queue structure is insulating other code, such as in data encapsulation, from 

used to store cached objects, but other data structures known 65 knowing the internal structure of the query string. Generally, 

to those skilled in the art may be used in accordance with as known to those skilled in the art, this is a common 

each implementation. programming technique to minimize code portions from 


US 6,393,415 Bl 


29 


30 


changes in data types and structures to minimize, for 
example, the amount of recompilatioo when a new data type 
is introduced or existing data type modified. Other 
techniques, such as hashing, may be used to generate a 
unique identifier for the input string, as known to those 
skilled in the art. 

It should be generally noted that a similar mapping 
technique is used in forming a Page Cache name. The 
technique used is as described for forming the Query Cache 
filename with additional qualifying terms in accordance with 
the "look and feel", such as display features, used to produce 
the Page Cache name. For example, if the displayed result- 
ing HTML page includes 15 listings/page, the Page Cache 
name includes a parameter in forming the name uniquely 
identifying the filename including the result set for a query 
in this particular display format. 

Generally, in this embodiment, the data query cache 
includes cache objects in which each cache object corre- 
sponds to a particular cached query resulting data set. 
Referring now to FIG. 36, shown is a block diagram of one 
embodiment of a data set as stored in the data query cache. 
Generally, each data set 250 includes header information 
252 and information corresponding to one or more business 
listings. Generally, header information may include infor- 


query cache. The complementary operation is also per- 
formed from persistent storage to the in-memory copy. For 
each of the above-named fields, object serialization, i.e., 
from memory to persistent storage device in cache, is 
performed by storing the data type, its length, and the data 
itself. It should be noted that the length may not be needed 
for each data field, for example, in fixed length data types. 
The complementary operation of object deserialization is 
generally performed by reading the fields in the same order 
as written to the cache. 

In this embodiment, other caches may have other storage 
techniques. For example, the Page Cache may be imple- 
mented as HTML files in a file structure located on a disk or 
other storage device. The PHTML execution tree may be 
implemented as an in-memory linked list or other abstract 
data structure representation of the C++ objects. 

It should be noted that in this particular embodiment, the 
data query cache may include different types of cached 
geographical data as may be used in performing different 
20 data queries. For example, the type of data cached described 
in the prior paragraphs is the "normal" business listing data 
as associated with a well-defined geographic area. Other 
businesses, for example, such as a florist or an airline, may 
not be associated with a single well-defined geographic 


10 


15 


mation describing the data query set, such as the number of 25 location. A business may not have any geographic bounds, 


business listings in the data set Other types of information 
may be included in accordance with each particular appli- 
cation and implementation. 

Each business listing 254 generally includes information 
that describes the business listing. More particularly, this 
information includes data that is cached as needed by other 
components in the Front End Server, for example, in per- 
forming various searches, data retrieval, and other opera- 
tions upon data in accordance with functionality provided by 
the embodiment. In this instance, the following types of 
fields of information are stored for each business listing 254: 

1) number of categories associated with this business listing 

2) latitude 

3) longitude 

4) business name 

5) city 

6) state 

7) list of categories associated with this business listing 

8) database key or identifier used as an index into the 
databases 

9) relevance information 

10) advertiser priority 

In the above fields, relevance information is Verity- 
specific information as it relates to the query. For example, 
this generally represents the frequency of words or terms in 
a document. The advertiser priority indicates a service level 
that may be used in presenting business listings, for 
example, in a particular order to a user. For example, if a first 
advertiser purchases "gold" level advertising services, and a 
second advertiser purchases "silver" level advertising 
services, when a user requests only 15 listings to be 
displayed, the "gold" level advertisements may be displayed 
prior to the other advertisements by other advertisers, such 
as the "silver" level service purchaser. Thus, a higher level 
of service may guarantee an advertisement be placed earlier 
in the displayed results. 

The technique used to store the data in the data cache from 
memory includes object serialization and deserialization 
techniques, as known to those of ordinary skill in the art. 
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such as if it is an Internet business with a virtual storefront 
accessible on the Internet Also, other businesses may be 
located in a particular well-defined geographic area, such as 
an airline with a physical presence in a particular city, but the 
service area which corresponds to the service offered does 
not correspond to the location of the business itself. To 
include businesses with these particularities, in addition to 
the "normal" business listing just described in which the 
geographic business location and service areas correspond, 
the concepts of multi-city and total-city placements have 
been included in this embodiment. 

Generally, multi-city placement may be described as 
representing a business' service area in multiple cities when 
data queries are performed. An example may be a plumbing 
service located in three (3) cities with service areas in ten 
(10) cities. The total-city placement may generally be 
described as representing a business* service area in all cities 
when searches are performed. An airline is generally an 
example of this which services all major U.Si cities. 
Generally, in this embodiment, the total city and multi-city 
search results are cached separately from the "normal" query 
results, but are composited with the normal search results 
prior to retrieving the data from the database. 

It should generally be noted that in this embodiment, the 
total and multi-city query results are retrievable independent 
of the "normal" search results. However, the storage format 
for this information, in this embodiment, may be as 
described for "normal" query results. Generally, other 
embodiments may use a different format for storage than the 
"normal" search results, for example, if other information is 
deemed to be important in accordance with each implemen- 
tation. 

The technique of performing the total and multi-city 
query search optimization in conjunction with the normal 
query caching will be described in paragraphs relating to 
FIGS. 37 and 38 that follow. 

Referring now to FIGS. 37 and 38, shown is a flowchart 
of an embodiment of a method for integrating total-city and 
multi-city cache results into "normal" cached search results. 


These techniques transform an internal storage format, as 65 At step 260, a total-city cache name corresponding to the 
may be stored in random access memory, to a format data query is formed. In one embodiment, the total city 
suitable for persistent storage in a file system, as in the data cache name is formed by starting with the string "SCOPE= 
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T" to identify a total-city name. Additionally, the following 
information is extracted from the original query string, as 
formed by the parser 

category, category id, business name, street address, 
keywords, longitude, latitude 
These key-value pairs are extracted from the original query 
string and appended to the "SCOPE=T" to form the total- 
city cache name. In one embodiment, these functions of 
extracting the information from the original query string and 
forming the total-city cache name may be performed by the 
same software as forming the name for the data query cache 
"normal" query name, such as by API calls to the same 
routines with parameters, as known to those of ordinary skill 
in the art of programming. 

At step 262, it is determined if the total-city query data set 
corresponding to the total-city cache name for the current 
query exists. If it does, control proceeds to step 264 where 
the total-city data set cached item is moved to the hot cache, 
if not all ready in the hot cache. A reference to this data set 
is saved for later retrieval in other processing steps. If at step 
262, a determination is made that the total-city query cached 
data set corresponding to the total-city cache name does not 
exist, control proceeds to step 266 where a. search is per- 
formed for the total-city query. At step 268, the search 
results are cached, as in the "hot"cache. A reference to these 
search results are stored for use in later processing steps. 
Generally, an empty or null search results stored in cache 
may be just as important for performance as a non-null 
search results that is cached. 

Control proceeds to step 270 of FIG. 38 where a multi- 
city cache name is constructed representing the multi-city 
cache corresponding to the current data query. In one 
embodiment, this multi-city cache name may be constructed 
by forming a string using the same fields extracted from the 
original data query string as formed by the parser in con- 
junction with forming the total-city name. Similar to form- 
ing the data query name for the "normal" cached search 
results, the string corresponding to the cached data set for a 
given query uniquely identifies the data set In forming the 
multi-city cache name, appended to the concatenated key- 
value pairs is a string of "SCOPE=M" rather than the string 
"SCOPE=T", as with the total-city cache name. 

At step 272, a determination is made as to whether there 
is multi-city cached data corresponding to the current multi- 
city cache name. If, at step 272, a determination is made that 
such a data set exists in the multi-city cache, control 
proceeds to step 274 where the data is moved to the 
"hot"cache, if not all ready located there. Additionally, a 
reference to this location in the "bot"cache is saved for use 
in later processing steps. If, at step 272, a determination is 
made that such a data set does not exist in the multi-city 50 
cache, control proceeds to step 276 where a search of the 
database is performed. The query results, if any, are cached 
in the "hot"cache with a reference to the results saved for use 
in later processing steps. 

At step 280, the total-city and multi-city data cache results 
are integrated with the "normal*' query results. After the 
"normal" query is performed, but before sorting the search 
results, the total-city-cached results, if any, may be com- 
bined with the "normal" query results. If there are no 
total-city cached results, the multi-city results may be 
included, if any. 

The combined search results are then sorted such that any 
redundant listings are removed. Any additional processing is 
performed, as in accordance with the user query, for 
example, as producing the listings which begin with "B", or 
only listing the top ranked fifteen (15) listings as ranked in 
accordance with other user specified criteria. 
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In all the caches, a garbage collection technique may be 
included to remove or delete cached objects that have been 
determined to be "old" in accordance with predetermined 
criteria. For example, in one embodiment using the LRU 
caching scheme, whenever the amount of free cache space 
falls below a threshold level, the garbage collection routine 
is invoked. The threshold level includes parameters relating 
to a predetermined number of cache objects and the accu- 
mulated size of the objects in the cache. In this embodiment, 
although there may be multiple conceptual caches, such as 
the "normal" data query cache, the multi-city cache, and the 
total-city cache, the cached results may physically reside in 
the same "hot" and "cold" caching devices. However, in this 
embodiment, the different types of caching results may be 
accessed independent of the other caching results. Other 
embodiments may have other organizations of the caches in 
accordance with other implementation and associated data 
requirements. 
[Information Retrieval] 

A variety of information retrieval techniques may be used 
to retrieve records stored in the Primary Database 812. 
Further details of the query engine 862 are presented in 
schematic format in FIG. 39. When the parse driver 858 of 
the parser 866 of one of the servers 808 delivers a parsed 
instruction to the query engine 862, the query engine 862 
may, in an embodiment of the invention, include information 
retrieval software 908 to retrieve records from the Primary 
Database 812 that correspond to the user's query. The query 
engine 862 may include more than one form of information 
retrieval software. For example, the query engine, in addi- 
tion including the information retrieval software 908 that is 
to be used to obtain listings in response to user queries, may 
further include banner ad retrieval software 909 for retriev- 
ing advertisements that relate to the user's query. 

In an embodiment of the invention, the information 
retrieval software 908 may include functionality of software 
such as the Information Server Version 3.6 software com- 
mercially available from a company known as Verity. Other 
commercial packages of information retrieval software are 
available, and the techniques described herein could also be 
employed using proprietary software coded by the user. In 
an embodiment, the information retrieval software 908 
includes the Information Server Version 3.6 software and 
additional extensions provided by the host of the GTE 
Superpages system. 

Referring to FIG. 40, steps by which the information 
retrieval software 908 obtains results are set forth in a flow 
chart 83. The information retrieval software 908 may at a 
step 82 access markup language files 906, as depicted in 
FIG. 25, which are produced by the extraction routines 902 
from the normalized data 900. In an embodiment, the 
markup language files consist of business listings that are 
stored in the Primary Database 812. The information 
retrieval software 908 may then, at a step 84 produce term 
lists 836 that are further used by the information retrieval 
software 908 to handle queries that are delivered to the query 
engine 862. The term lists 836 may consist of a linked list 
for each term that appears in one of the business listings, 
with the elements of the linked list including a document 
identifier for the business listing and certain statistics regard- 
ing the frequency of occurrence of the particular term in 
each document and in the document set as a whole. The 
banner ad retrieval software 909 may similarly generate and 
use banner ad term lists 837 that are further used by the 
banner ad retrieval software 909 to handle generation of 
appropriate banner ads. Next, at a step 90, the term lists, 
which in an embodiment are generated using Verity 
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software, may be expanded at a step 86 to include synonyms 
for the terms appearing in the business listing. For example, 
if the term "diner" appears in a business listing, then the term 
"restaurant" might be assigned to the file for that business 
listing as stored in the Primary Database 812. The expansion 5 
of the listings to include synonyms of the words included in 
the listings may be accomplished by execution of PHTML 
scripts or other programming techniques. The expansion 
may establish a hierarchical structure; for example, the term 
"restaurant" may be stored in a tree that includes the 10 
subcategory of "ethnic restaurant/* which may further 
include the sub-category "greek restaurant." PHTML scripts 
may be provided to establish the tree structure and to operate 
on the tree structure to retrieve results that will be provided 
to the user. The steps 82, 84 and 86 may be accomplished at 15 
initialization of the system, thus establishing and expanding 
the term lists 836, 837 for later use. 

Once the system is initialized, the system may operate to 
obtain results that are to be displayed to the user. The steps 
for obtaining results may be seen in a flow chart 88 dis- 20 
played in FIG. 41. Referring to FIG. 41, the parse driver 858 
may at a step 20 parse a user query and deliver the parsed 
query in suitable form for handling by the query engine 862. 
The query engine may include the information retrieval 
software 908. At a step 22, the query engine 862 may operate 25 
the information retrieval software 908 to take the parsed user 
request and expand the query, turning the user request into 
a detailed query. Next, at a step 24, the information retrieval 
software may operate on the expanded term lists 836 by 
identifying documents associated with the terms identified in 30 
the expanded query. In an embodiment, the term lists 836 are 
the business listings described in connection with steps 82, 
84 and 86 above, expanded to include synonyms and terms 
that are determined to be related to the words in the business 
listing. Identification of documents may be accomplished by 35 
a variety of information retrieval techniques. Documents 
may also be associated with queries by sorted relevancy 
ranking, clustering (automated grouping of related 
documents), automated document, summarization (creation 
of content abstracts, not simply the first few sentences of the 40 
document) and query-by-example (turning an individual 
document into a query in order to retrieve "more documents 
like this**). These functions may be accomplished by soft- 
ware techniques, such as having a table of pointers having 
as an argument a tokenized version of each possible term 45 
from the expanded user query from the step 22. The table of 
pointers may point to the location of a term list 836 for each 
such term. The term list may be a linked list of documents 
that include the term. The linked list may include informa- 
tion about each document, such as the number of occur- 50 
rences of the term in the document, the inverse frequency of 
the term in the entire set of documents, the association of the 
document with other documents, the association of the 
document with categories, and the like. 

A variety of different techniques can be used to index 55 
documents for information retrieval. In embodiment, an 
indexing architecture such as that provided by Verity allows 
for incremental indexing, so that only new, updated or 
deleted documents require changes, avoiding the need for a 
complete re- index each time a document changes. Online 60 
identifiers may be provided, so that searches can continue 
while the identifiers are modified. This function is also 
provided by the Verity software. 

At a step 28 a variety of weighting algorithms can be used 
to rank documents identified in the step 24 according to the 65 
information stored in the term lists 836. For example, a 
simple weighting algorithm might take a single term query, 
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such as a category of information, and rank each document 
in a term list 836 in numerical order according to the product 
of the term frequency (the number of times a term appears 
in the document) and the inverse document frequency (the 
inverse of the number of rimes the term appears in the entire 
document set). 

Once the documents are ranked, at a step 30 a list of the 
ranked documents may be further processed by the infor- 
mation retrieval software to provide a results page. In 
particular, at the step 30, the information retrieval software 
908 may determine categories into which the retrieved 
documents fall. In an embodiment, the categories are yellow 
pages categories, which have been previously assigned to 
the documents, which are business listings, prior to entry of 
the business listings in the Primary Database 812. Thus, at 
the step 30, the information retrieval software 908 deter- 
mines what categories are associated with the business 
listings retrieved by the ranking at the step 28. Next, at a step 
98, the information retrieval software 908 may compare the 
categories identified at the step 30 to the terms in the user 
query. If categories are present that do not include any of the 
terms in the user query, then, at a step 92, such categories 
may be discarded. Thus, the user will not retrieve categories 
that are unrelated to the user query. Such categories might 
otherwise appear, for example, if the information retrieval 
software 908 retrieves a business listing that is associated 
with two unrelated categories, only one of which is relevant 
to the user query. For example, a query for a restaurant might 
retrieve a listing for "Joe's restaurant and bowling alley.** 
The information retrieval software 908 might then retrieve 
the categories "restaurants** and "bowling** that would have 
been associated with that listing. The "bowling" category 
would be discarded, because the user query for a restaurant 
is unrelated to the "Ijowling"* category. The term comparison 
may use an expanded version of the terms in the query and 
in the categories. Thus, a category would not be discarded if 
it includes a synonym of a query term, even if the category 
does not include an exact term match. 

Once the non-matching categories are discarded at the 
step 92, the information retrieval software may, at a step 94, 
determine whether there are any remaining categories. If 
not, then control proceeds to a step 96, at which the user is 
informed that there are no matching categories. The user 
may then be returned to the query screen. If, at the step 94, 
at least one category remains, then, at a step 98, the 
information retrieval software determines whether there is 
more than one category. If not, then at a step 100 the system 
may display the actual business listings that appear in that 
one category to the user. If at the step 98 it is determined that 
more than one category remains, then at a step 102 the 
system may display a results page that consists of a list of the 
remaining categories. The results page may further include 
an indication of the number of listings that are associated 
with each category. 

The document identifiers established for information 
retrieval software 908 may maintain pointers to other docu- 
ments or to sources of the documents, such as URLs or file 
names. Thus, the identifiers may be stored apart from the 
documents allowing separate, non-invasive use of the 
identifiers, while maintaining the integrity of the data. 
[Common Term Optimization (CTO)] 

In an embodiment of the information retrieval system 
disclosed herein, common terms may be identified in order 
to optimize the retrieval of information in cases where user 
queries employ such terms. 

A series of steps may be performed as pre-processing 
operations in order to classify and establish query result sets 
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for common queries. Referring to a flow chart 31 in FIG. 42, synonym or related term, and the inverse document fre- 

at a step 32 common terms may be identified prior to system quency of the term, synonym or related term in all docu- 

initialization. Designation of common terms may be per- ments in the set. In an embodiment, the synonyms and 

formed based on a number of different factors. For example, related terms may be included in the actual business listings 

a single word might in theory be designated a common term, 5 that are used to generate term lists 836, so that those listings 

if it appears with a high frequency in result sets obtained by will be included in the generation of common term lists. In 

users. It is noted that a single word common term may offer an embodiment, the listings themselves may be classified as 

relatively little benefit in search efficiency, because the term to common terms and synonyms or related terms of those 

lists 836 already permit searching based on individual terms. terms. Listings may be further classified as to sub-contexts, 

Alternatively, common terms might consist of multiple word 10 depending on the search context. Listings using identical 

combinations of any length, whether bi-grams, tri-grams, or terms should also be included in term lists, because they use 

n-grams. Thus, words that co-occur in high frequency can be identical token identifiers for such terms. For example, the 

designated as common terms, such as in a bi-gram format. term "Boston** should be understood in a nationwide search 

For example, the bi-gram "Boston — restaurant" might be to include listing in both Boston, Mass. and Boston, Ky., 

designated a common term. 15 because the token for the term "Boston" will be the same in 

Next, at a step 33, terms may be linked to specific each case. Result sets must be identified as tokenwise 

contexts; that is, terms may be designated or classified as semantically related to the classifications that are possible in 

common terms in part according to their context. For a search. Results are thus classified into common term 

example, the term "Boston," might be considered a common groups on a lis Ling-by-listing basis, 
term if entered in the "city" field, but it might not be 20 At a step 48, the common term lists 836 for combined 

considered a common term if entered in a "business name" terms can be stored in a designated area of the primary 

field or a "category" field. Similarly, the term "restaurant" database 812, front end server 804, or server node 808-810 

might be a common term in the "category" field, but would that allows a rapid search in the event common term 

not be considered a common term in the "city" field. Thus, combinations are included in the user query. The common 

at the step 33, the common term sets may be structured to 25 term lists are thus assigned to a special results area for 

reflect context. Thus, the bi-gram "Boston — Restaurant" common term searches. 

might be stored as an expanded form that reflects both the The steps 46 and 48 may be performed upon initialization 

term and the context in which it is to be treated as a common of the system . Thus, with the pre-processing steps 32, 33 and 

term, for example "City=Boston; Category=Restaurant ." 35 and the initialization steps 46 and 48, result sets are 

Referring to FIG. 42, it may be desirable to expand, at a 30 established for common term searches, and the result sets are 

step 35, the terms that are to be designated as common stored in a special location in memory for rapid retrieval, 
terms. Thus, each term might be expanded to include both Next, at a step 49, query rules may be established that 

synonyms for the term and other terms that are semantically direct appropriate user queries to the special location in 

related to the common term in the established context for the memory established at the step 48. Referring to FIG. 43, the 

term. For example, the common term "category=restaurant" 35 user might enter a query on a template 34 that is displayed 

might be expanded to cover results in which synonyms for as a page, such as markup language page, on the user's 

restaurant are included in the results, such as "diner," "bar browser 824. The template might include fields 36, such as 

and grill," "eatery" and the like. Similarly, a city term might a category field 38, a business, name field 40, a city field 42 

be expanded to include suburbs or neighborhoods; thus, the and a state field 44. When the user enters a term into one or 

term "City=New York" would be expanded to include 40 more of the fields 36 and initiates a query, such as by 

"City=Brooklyn," "City=Queens," and "City=Manhattan." pressing "enter" on the keyboard or clicking the appropriate 

Note that the synonyms for a given term might be different screen location, the query is delivered to the parser 866 of 

depending on the context. For example, the term "Dorches- the server 808 to which that user has been routed. The query 

ter" might be a related term for "City=*Boston," but it might is then used, as described above in connection with FIG. 41, 

not be a related term for "business name=Boston." 45 to retrieve documents. In an embodiment of the invention, 

The pre-processing steps 32, 33 and 35 might be accom- the documents that are retrieved at the step 28 and displayed 
plished in a different order, and other steps might be at the step 30 of FIG. 41 are a set of matching categories for 
included in embodiments of the invention. Once common the query. For example, as depicted in FIG. 44, if the user 
terms are identified, linked to contexts, and expanded at the enters the category "art supplies,** the information retrieval 
pre-processing steps 32, 33 and 35, it is possible to establish 50 software 908 may retrieve a set of matching categories that 
lists or identifiers at a step 46 that include the expanded relate to art supplies. The retrieved categories may be 
common term n-grams. One way of dealing with common ordered alphabetically, by order of significance, or grouped 
term combinations would be to generate in advance term by sub-categories. The user then may select categories 
lists 836 that are predicted to be used with some frequency among the matching categories to receive either further 
(e.g., restaurants, Boston, New York, etc.) and to pre- 55 sub-categories or documents, such as advertisements or 
calculate the intersection of the likely combinations. This other markup language pages, that correspond to the cat- 
approach requires substantial processing and would have to egories. In an embodiment, rather than matching categories, 
be performed frequently, given frequent changes in the the information retrieval software 908 may immediately 
identifiers. Instead, it is possible, at the step 46 to create retrieve matching documents, such as specific advertise- 
special identifiers, or term lists 836, that represent the 60 ments or other markup language pages, rather than catego- 
expanded common terms, as linked to their contexts. Thus, ries of documents. This direct retrieval step may be 
a term list 836 might consist of a linked list of documents, accomplished, for example, when one of the user-entered 
such as business listings, that contain the terms "Boston" categories is an exact match to one of the categories included 
and "restaurant," (or synonyms thereof) in the contexts in in the term lists 836. 

which those terms are common. The term lists 836 may, like 65 A similar series of steps takes place if the user enters a 

other term lists 836 described elsewhere herein, may further query for a particular location in the city field 42 or the state 

include information as to the term frequency of each term, field 44, or for a business name in the business name field 
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40. The information retrieval software 908 retrieves docu- 
ments from the term lists 836 that correspond to a ranking 
of an expansion of the user-entered query. 

When both a category and a location or a business name, 
or all three, are entered by the user, then the information 
retrieval software 908 may, in a conventional manner, 
retrieve term lists 836 that correspond to each of the terms 
of the query, such as a list corresponding to the category 
"restaurant** and a list corresponding to the city field "Bos- 
ton " The information retrieval software 908 could then 
perform an intersection of the two sets and perform a 
ranking of the related categories (e.g., Italian restaurants in 
Boston, French restaurants in Boston, etc.) or related listings 
(for specific Boston restaurants). Because the term list 836 
for documents containing the term "Boston" (including all 
businesses in Boston) and the term list 836 for documents 
containing the term "restaurant** (including all restaurants, 
nationwide) are both very large, the processing involved in 
retrieving each list and performing an intersection in order 
to identify matching categories or documents can be sub- 
stantial. Accordingly, it is desirable to reduce the processing 
involved. 

The information retrieval software 908 may be pro- 
grammed with query rules at the step 49 to recognize when 
a query includes a common term n-gram, such as "City= 
Boston; category=restaurant." That is, whatever common 
terms are identified at the pre-processing steps 32, 33 and 35 
should be recognized by the information retrieval software 
908, so that queries that use the common terms in the 
appropriate contexts (or synonyms or related terms in those 
contexts) are designated for special processing. In particular, 
the information retrieval software 908 may be programmed 
to execute the search for the user's query in the special area 
of memory that was established for storage of the special 
common term lists 836 at the step 48 of FIG. 42. 

In one embodiment of the invention, referred to as "CCC- 
indexing," the common terms that are selected for combined 
common term lists and special storage are bi-grams in the 
form "City«xxx; category=yyy** and in which the most 
common categories, such as restaurants, are found in the 
category field and the largest cities, such as New York, 
Boston, and the like, are found in the city field. 
[Data Integration] 

Referring now to FIG. 45, shown is one embodiment of 
the database included in the BackOffice component as 
included in FIGS. 2 and 4. Generally, data updates included 
in the database come from three different sources in this 
particular embodiment. One source is on-line updates, as 
provided by users making updates or entering new informa- 
tion for business listing via network connections through the 
BackOffice component as through the Front End Server. A 
second source of data updates is based on foreign source 
updates. Generally, foreign source updates are those update 
records which come from a different data source than the 
original existing database. A third type of data integration or 
update source is referred to as a native source update. 
Generally, a native source update is when an updated version 
of the existing database having the same source as the 
existing database is provided. For example, a database copy 
may be provided as an update on a monthly basis using full 
sets of data where a data provider provides an updated 
version of the same data set. The native source data inte- 
gration procedure integrates those changes in the new data 
set into the existing database. This is in contrast to a foreign 
source update, for example, where the existing database is 
provided by one vendor, and the update records for example, 
are provided by a different vendor. The update vendors being 
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from a foreign source are called foreign source data inte- 
gration or updates. 

It should be noted in this particular embodiment that the 
native source update records are provided using full sets of 
data. In other words, the existing database is a complete 
database. The native source updates are provided in the form 
of a complete database as opposed to only providing update 
records. The foreign source update records are generally 
records obtained from a source different from the working 
database and are merged into the existing database. 

Shown in FIG. 45 is a native source update database 1500 
which is integrated into the unfiltered database 1504. 
Generally, this is done by performing comparisons of the 
records of the native source update database 1500 and the 
unfiltered database records 1504 in determining the various 
types of operations that need to be performed to integrate the 
changes from the native source update into the unfiltered 
database. This will be described in more detail in paragraphs 
that follow. Applying data enhancement techniques to the 
unfiltered database, these record changes are integrated into 
the working database 1508. Generally, the unfiltered data- 
base 1504 is a complete version equivalent to the working 
database. However, the records included in the unfiltered 
database 1504 generally include raw data which has not had 
the benefit of the data enhancement techniques as applied to 
the working database records 1508. The on-line update 
records 1506 and the foreign source update records 1510 are 
integrated directly into the working database copy 1508. It 
should generally be noted that the foreign source update 
records 1510 are integrated or merged into the working 
database records 1508 by applying data merging techniques 
that will be described in more detail in paragraphs that 
follow. 

It should also be noted that the denormalized data, as 
included in the BackOffice component and the Front End 
Server, include in this particular embodiment, three tables or 
components of data. Generally, the three components of data 
include a category file, a fact file, and a business listing file. 
The business listing file has been previously described in 
conjunction with the architecture in other sections of this 
description. The fact file includes information additionally 
provided by various advertisers or business services which 
are generally static in nature. For example, the fact file may 
contain information such as hours of operation and extra 
attributes such as brand names or products produced by a 
business. This file generally does not change with updates. 
The third file is a category file may include a category 
identifier and a corresponding heading. Generally, the cat- 
egory identifier is a numeric quantity or other identifier that 
may be used in performing queries. The heading is a textual 
description of the various category identifiers which may be 
used either for performing data queries. In the various data 
integration and updates, as will be described in paragraphs 
that follow, it should be noted that the business listing file is 
generally what is updated when considering the techniques 
which will be described. However, the category file is also 
updated as part of the native source update, as will also be 
described in paragraphs that follow. 

In paragraphs that follow describe general integration 
techniques for the foregoing types of data updates. Each of 
these techniques which will be described is associated with 
one type of data integration. However, in other preferred 
embodiments, each technique may be associated with and 
applied to other data types. 

The foreign source update will be described in paragraphs 
that follow. However, the concepts and techniques included 
herein may also be applied to different types of data updates. 
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Generally, in the description that follows for data entries, database, operations are determined and applied to the 

there is one existing record or data entry per business listing. existing working database. Generally, as will be described, 

In this particular embodiment, a business listing is the transactions with respect to the existing working database 

atomic unit of granularity by which updates are performed. are determined. Generally, an update to an existing record is 

Any information and data such a phone number, name and 5 performed so as not to lose any existing information while 

address associated with a particular business entity is con- also incorporating the new additional information or updated 

sidered to be part of one logical piece of information or information. For example, an existing listing includes a 

record. Thus, in the descriptions that follow, updates are business name and address, and phone number, but no 

made with regard to the information associated with one e-mail address. A foreign source update record includes a 

particular business listing or entity. 10 business name and address, e-mail address, and phone 

The techniques which will be described regarding the number. The information from the foreign source update 
foreign source update generally assume that an existing record is included in the existing database in union with the 
database and update records are provided, and that each fields that are blank in the update record such that the e-mail 
originate from different or foreign sources. It should gener- address in the existing database is not removed when the 
ally be noted that since the sources are different, there is no 15 updated information from the update record is applied. It 
general assumption made as to particular data fields or the should be noted that in this embodiment, no delete opera- 
structure of the foreign records as compared to the existing tions are performed with the foreign source update data 
database. It is first determined whether there is a matching integration due to the nature of combining data originating 
entry in the existing database for an entry in the updated from different sources. However, other embodiments may 
version of the database. If no match is found in the existing 20 include delete operations in addition to update and modify 
database for an entry or business listing which appears in the operations in foreign source data integration, 
updated version of the database, this new entry is added and Referring to FIG. 46, at step 1000 a comparison is made 
integrated into the existing database. The techniques which between the phone number of an update record and the 
will be described in paragraphs that follow may be phone number field of each entry in the existing database. At 
adaptable, as known to those skilled in the art, to update 25 step 1000, a determination is made as to whether or not the 
situations in which an implementation uses something other record in the latest version of the database copy is an 800 
than two complete sets of data when performing a system phone number. If a determination is made at step 1000 that 
update. the phone number of the current update entry is not an 800 

In this embodiment, this process of foreign source update number, control proceeds to step 1008. At step 1008, the 

is performed in the BackOffice component 818 in which the 30 procedure "match phone number^ is performed to produce a 

existing database to be updated is generally in normalized subset of one or more entries of the existing database which 

form. The updated version of the database may be in match the existing phone number. Control proceeds to step 

normalized or denormalized form. Depending on the form, 1010 where the procedure "name match" is performed, 

additional processing steps, as known to those skilled in the Generally, "name match" will be described in paragraphs 

art, may be needed to retrieve and update the actual files that 35 that follow to determine whether there is a business name 

include the data, for example, associated with a particular match for a particular entry. Control proceeds to step 1012 

business entity or record. In the description below, the where "derive score" is performed based on the zip code and 

described technique assumes that each business listing gen- the name match score. Generally, the result of step 1012 

erally includes the following data items: business name, zip produces a score representing a statistic relative to deter- 

code, and at least one of a primary phone number or toll-free 40 mining whether two entries in a particular database and an 

phone number. Generally, the foreign source integration updated version of the database match, 

technique is based on the premise that a phone number and After performing step 1012, control proceeds to step 1020 

zip code of a business are sufficiently unique to significantly of FIG. 47 where a comparison or a determination is made 

reduce the matching problem to comparisons of a few as to whether or not the derived score is greater than 50%. 

listings. 45 If the derived score is greater than 50%, control proceeds to 

In paragraphs that follow, a determination is trying to step 1034 where a determination is made whether there is 

being made as to whether entries in the update and existing only one matching entry in the database for an update 

database match to further determine if update records are to record. If a determination is made at step 1034 that there is 

be added, or if existing database records are to be deleted or only matching entry in the database, control proceeds to step 

modified. 50 1042, where a determination is made that a match has been 

Generally, the matching technique described for foreign found. Alternatively, if at step 1034 there is more than one 

source update determines a correspondence between the matching entry in the database for a record in the current 

foreign source update records 1510 and the records in the updated version of the database, control proceeds to step 

existing working database 1508. The matching technique 1036, where a determination is made whether there is only 

generally includes: 1) determining which records in the 55 one entry with a maximum score. If there is only one entry 

existing working database match which update records; 2) if with a maximum score, control proceeds to step 1046, where 

more than one record in the existing database correspond to this maximum scoring entry in the existing database is 

the same record in the existing working database, determin- determined to be the matching entry for the updated version, 

ing which record in the existing database is the closest match If at step 1036 there are multiple entries with the same 

for the update record; and 3) if the foreign source update 60 maximum score, control proceeds to step 1038 where addi- 

records include duplicate records such that multiple update tional processing is required to determine which is the 

records correspond to the same set of one or more existing matching entry, if any. 

database records, collapsing the duplicate foreign source It should generally be noted that the score threshold of 

update records into a single update record that is matched to 50% may be tuned and varied for each particular implemen- 

a single record in the existing database. 65 tation and embodiment. This value is generally a config- 

After determining which records in the foreign source urable threshold value that may be defined heuristically, for 

update correspond to which records in the existing working example, by examining data samples. 
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The processing of step 1038 is generally performed Referring back to FIG. 46, if at step 1000 a determination 

off-line. It may be done manually or in an automated fashion is made that the phone number of the updated record is an 

in accordance with the types of data in the existing database. 800 phone number, control proceeds to step 1002 where a 

For example, at step 1038, having multiple entries with the determination is made as to whether or not the phone 

same maximum score may mdicate that there is an error or 5 number, including the area code, and the zip code match one 

corrupuon in data. For example in one embodiment, an or more entries m the existing diabase. At step 1002, if 

alternate technique is used where if any record has the same merc fa a determination mat one or more entries in ^ 

reLrd^ * CODSldered 35 bcu * a matctun 8 existing database match the phone number and zip code of 

' If'at' step 1020 a determination is made that the score is in U ff record > control proceeds to step 1006 where a 

less than or equal to 50%, control proceeds to step 1022. At 10 ^ of °<f or more matching entries is found. Control 

step 1022, a determination is made as to whether or not the V^occcds to point B indicated at step 1010 in FIG. 46 

difference in the name length is less than or equal to three. wh< * re execuD 'on continues. 

If the difference in the name length field is not less than or If a determination is made at step 1002 that the phone 

equal to three, control proceeds to step 1028 where a number and zip code do not match any entries in the existing 

determination is made in that no matching entry exists in the 15 database, a determination is made at step 1004 that no match 

database. It should be generally be noted that the decision exists in the database for the current update record, 

process and the comparison process performed in steps 1020 Referring now to FIG. 48, shown is a flow chart of an 

and 1022 are performed for each matching entry in the embodiment for the "match phone number" routine as 

subset as produced from step 1008. It should generally be performed at step 1008. At step 1050, a table is used with old 

noted that the threshold length of three for the name length 20 and new area codes and exchanges to determine if there are 

used in step 1022 may be varied and tuned for each one or more matching entries in the existing database which 

particular embodiment and implementation. match the phone number of the current update entry. 

At step 1022, if a determination is made that there is at Generally, the processing step of 1050 and the decision 

least one entry in the existing database with a name length made at step 1052 may be used, for example, where area 

difference less than or equal to three, control proceeds to step 25 codes have changed due to the increased volume of phone 

1024, where the name edit distance heuristic may be used to numbers which require additional area codes to a particular 

compute the name distance. Generally, the name edit dis- locality to be added. For example, the 508 area code may be 

tance is the minium number of insertions, deletions, and expanded to include the 781 area code. Thus, an existing 

substitutions at the character level to turn one name entry or phone number may be included in the database with either 

string into a second name entry or string. The number of 30 the 781 or the 508 area code depending on the age of the data 

states that string A must pass through to be transformed into in the database. If a determination is made at step 1052 that 

String B is an entry or quantity referred to herein as the name either an old area code and exchange, or a new area code and 

edit distance. For example, the textbook entitled "Text exchange match, control proceeds to step 1054 where a 

Algorithms", by Maxime Crochemore and Wojciech Rytter subset of one or more matching entries is formed. Control 

generally describe a technique for the name edit distance 35 proceeds to step 1056 where control returns to the calling 

heuristic. procedure. In this instance, control returns to step 1008 

At step 1024, the name edit distance is computed, for where subsequent control proceeds to step 1010 of FIG. 46. 

example, using dynamic programming techniques known to If at step 1052 a determination is made that there is no old 

those skilled in the art, such as using a finite state machine, or new area code and exchange in the existing database 

for each matching entry as in the subset produced by step 40 which match the current entry in the updated version of the 

1008. At step 1026, if a determination is made that there are database, control proceeds to node C of the "secondary 

one or more entries with a distance less than 10% of the search" in FIG. 51 at step 1086. Generally, the processing 

length of the update name string, then control proceeds to which occurs in the steps of FIG. 51 attempt to find semantic 

step U00 of FIG. 52 where a determination is made at step equivalents of the name fields indicating a possible match. 

1100 as to whether or not there is only one matching entry 45 At step 1086, the name of the update record is tokenized. At 

in the subset as derived from the Step 1008. step 1088, "stop words" are removed from the name field. 

Referring now to FIG. 52, if a determination is made at Generally, stop words may be words which may be ignored 

step 1100 that there is only one matching entry, control when doing a name comparison. For example, in this 

proceeds to step 1112, where determination is made that a particular embodiment, the words "and", "or", "the", "a", 

matching entry has been found. If at step 1100 a determi- 50 "an", "to", "in", and "at" are considered "stop words" for 

nation is made that there is more than one matching entry in which a matching entry may contain any number or com- 

the existing database for a foreign source update record, bination of these and the match should still succeed. Thus, 

control proceeds to step 1102, where a determination is at step 1088, these words are removed and not considered 

made as to whether or not there is only one matching entry when performing a name comparison, 

with a minimum distance. If a determination is made that 55 At step 1090, a search of the existing database is per- 

there is only one matching entry with a minimum at a formed on the conjunction of the tokenized name field 

distance, control proceeds to step 1108 where it is deter- components and the zip code. Generally, the search is being 

mined that an entry in the existing database with the mini- performed for entries in the existing database which match 

mum distance is considered a match to the update record in zip code and the different components of the name field. At 

the foreign source update. If at step 1102 a determination is 60 step 1092, a determination is made as to whether or not there 

made that there is more than one matching entry with a are more than 5 matching entries in the existing database for 

minimum distance, control proceeds to step 1104 where the current update record. If at step 1092 a determination is 

additional processing may be required in accordance with made that there are more than five matching entries in the 

the types of data included in the database. The additional existing database, control proceeds to step 1094 where a 

processing required is generally the same types of process- 65 determination is made that no match has been found. If at 

ing that may be performed in accordance with the previously step 1092, a determination is made that there is not more 

described step 1038 of FIG. 47. than five matching entries, control proceeds to point B in the 
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processing which is shown in FIG. 46, step 1010 where these . The update techniques for native source assumes that two 

name matching entries are used as the subset upon which full sets of data are used— the updated database version, and 

subsequent processing is performed. an unfiltered or raw version 1504 of the existing working 

Referring now to FIG. 49, shown is a flow chart of the database. Generally, the techniques that are described below 

steps of one embodiment performing a "name match" as part 5 with regard to native source processing are data enhance- 

of a routine processing as ^invoked from step 1010 of FIG. men t techniques applied to the unfiltered database 1504 to 

^GeneraUy, the steps of FIG. 49 attempt to perform and ^ me workm daUbase 150g of nG 45 

find semantic equivalents of the names of a business in this d„g™w „™, # cn- m * . 1 ^ .u 

particulars^ nf^H^°Z ' V*? ^ ^ ™^° a f 

formed by step 1008, {he name entries are canonized. in ° f ** f dala * P«fij?™» «J«* ^ complete sets of 

Generally, canonization rules are a set of transformations 10 ******* natlve sourc f* G 5«erally, at step 1400, the latest 

which occur, for example, transforming abbreviations and M of data received ^ ch as from a data provider is submitted 

the like to semantic equivalents allowing for a common mto lhc databasc and compared against the set that is in the 

denominator of terms to be searched for. For example, if all existing database. All of the records in the data set are loaded 

entries in a database use the entire work "incorporated" to in ^ following form. For comparison purposes, in the steps 

indicate an incorporated business, then if a name entry 15 mat follow there is a distinct record ID followed by a string 

includes the abbreviation "inc", this is expanded to the full where the string is all the fields from the record concatenated 

name "incorporated" prior to being compared. Generally, the together for comparison purposes in steps that follow. In this 

precise canonization rules or transformations depend upon particular instance record I.D.s are unique against the set and 

the particular data being examined in a particular applica- indexed. As a result of processing at step 1400, the delta or 

tion. 20 difference between the two data sets is produced. Each entry 

Control proceeds to step 1062 where the name field is in this delta or difference is classified as an insert, delete, or 

tokenized into components. At step 1064, a setwise contents update operation. A record is inserted into the existing 

comparison of the name components of each entry is deter- database in which identifiers are in the new version of the 

mined against the current update entry. At step 1066, a score data set but not in the existing database. All records which 

is computed for each name comparison of the existing 25 have identifiers in the existing database, but not in the new 

database entry with a record of the updated version of the version, are slated for deletion from the existing database, 

database. The score is computed as one point per matching Records in which identifiers are in both sets, but, however 

component. At step 1068, control returns to step 1010 where have associated strings that differ are considered update 

subsequent processing resumes with step 1012. records having data contents in the string that is updated for 

Generally, the processing steps of FIG. 49 attempt to 30 the corresponding identifiers. At step 1402, the update 

formulate a numeric quantity or metric for determining records which include inserts and update transactions are 

whether two name entries match. This weighted value or applied to the existing database. At step 1404, certain data 

concatenation is used in further comparison in combination post processing is performed as will be described further in 

with other field, such as the zip code, and arriving at a final the paragraphs that follow. 

quantity in determining whether or not name fields of an 35 FIGS. 46-54 generally describe data integration of the 

existing database entry and an update record match. native source updates which are applied to the database of 

Referring now to FIG. 50, shown as a flow chart of the business listings and categories. In summary, for both busi- 

steps of one embodiment for performing the routine "derive t ness listings and categories, comparisons are made between 

score", as performed from step 1012 of FIG, 46. Generally, records of the native source unfiltered database and native 

derive score attempts to produce normalized metric or score 40 source update. 

based on the name field and the zip code. At step 1080, the Referring now to FIG. 54, shown are more detailed steps 

score previously derived from name match for each entry is of one embodiment of step 1400 involving the computation 

updated by one if the zip codes of an existing database entry of the data update as pertaining to the native source business 

match an updated entry. At step 1082 this score is normal- listings previously described. At step 1406 a comparison is 

ized by taking the score computed thus far and dividing it by 45 made between the existing database copy with the updated 

the number of tokens in the foreign source entry name field. database copy by comparing the record identifiers and the 

It should be noted that other techniques may be used to string concatenation which represents the remainder of the 

produced a normalized score as in step 1082. At step 1084, records. At step 1410 each update record is classified as one 

control returns to the point of call. In this particular instance, of a matching entry, an insertion, a deletion, or an update 

control returns to step 1012 where processing resumes with 50 with respect to the existing database. At step 1416, a record 

step 1020 of FIG. 47. is determined to be matching if the record identifier and 

Just described with regard to FIGS. 46 through 52 are string field in the existing and updated data base copies 

processing techniques for determining matching entries for match. 

foreign data. What will now be described are techniques At step 1420, a record has been classified as one to be 
which provide for data enhancements where the two data- 55 inserted if there is a record with a record identifier in the 
bases or two data sources being integrated are from the same update database which is not in the existing database, 
source. Generally, where there is this native source Subsequently, at step 1418, data enhancements are per- 
processing, there wfll be fewer differences between the data formed and the record is integrated into the working data- 
entries due to the fact that both data sets come from the same base. It should be noted that the data enhancements also 
source. Thus, the techniques which are described in para- 60 performed in step 1428 is described in more detail in 
graphs that follow may generally be referred to as data paragraphs that follow. 

enhancements. However, similar to the processing just At step 1424, a record has been classified as one to be 

described with regard to foreign source integration and deleted from the existing database if there is a record with 

processing, the concepts and processing steps which will be the record identifier in the existing database not in the 

described may be readily adaptable to other types of data 65 updated database. Subsequently, at step 1422, the data 

updates in accordance with other particular implementation operation is performed integrating the data updates into the 

and data sets. existing working database. 
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At step 1430, a record is considered an update transaction update records match, but the beading names differ. At step 

to an existing record in the existing database if the record 1476, data enhancements are performed and the update 

identifiers match, but the remainder of the record rep re- operation is integrated into the working copy of the catego- 

sented as a string does not match. Subseqently, at step 1426, ries includes in working database 1508. 

the longitude and latitude of a record may be updated if the 5 Th e data enhancements, as performed at steps 1468 and 

address has been modified. At step 1428, data enhancements 1476> ^ ^ ^ mdude processin of 

may be performed to the record and the data update is me headi For j thT processing to enhance the 

Sred^aT^^ g ^ ^ tCXt 0f ^ ladings may include text transformations such 

mmecaseofstepl416wherematcrnngentriesarefouiid, "ff «W«^owcr case justification, consolidation of 
no further processing may be required foT existing database 10 abbreviations, and removal of idiosynchraac and slang 
or the updated database record. However, at steps 1420, terminology. Trie function of these data enhancements is to 
1424, 1430, update records or transactions are generated to generally filter the data to provide more accurate determi- 
modify the existing database. It should generally be noted nation of . mat cnmg or corresponding categories, 
that any of the foregoing operations which are Referring now to FIG. 56, shown are general post pro- 
modifications, including updates and deletions, to the exist- 15 cessing steps for one embodiment of expanding more 
ing working database records may be conditionally per- detailed steps of step 1404 of FIG. 53. Generally, these steps 
formed in an embodiment of the invention. A protection or may be performed to the category file as included in the 
locking technique may be included in the database, for working database 1508. 

example, which prevents a deletion or modification of a At step 1440, new categories may be added. Generally, a 
particular business listing included in the database regard- 20 data vendor may not provide an integrated version of all 
less of the processing classifications of FIG. 54. business categories. It may be possible to enhance some 
The data enhancements, as performed at steps 1418 and record categories as additional data is added. For example 
1428, are generally data filtering steps prior to integrating a restaurant may be a particular type of category and there 
the data update into the working database 1508. The data may be other subdata organized in the structure of the record 
filtering techniques generally facilitate matching corre- 25 indicating that there is a particular type of restaurant in 
sponding records when performing updates. Data enhance- accordance with the various ethnic cuisines, such as French 
merits may include, for example, upper/lower case or Italian. Post-processing as in step 1440 may be written to 
justification, detection of synonyms and/or acronyms, trans- search the data file in accordance with recognized structural 
formation of abbreviations as may be used in business format and add additional categories in accordance with any 
names (e.g., corp., inc.), street addresses (e.g., st., pL), and 30 categories and subcategories. For example, if a deterrnina- 
city and state names. Other embodiments may include other tion is made that there is a large number of restaurants with 
enhancements in accordance with the type of data and the a subcategory of French, a new record category may be 
various applications. added which is "French restaurant". Similarly, an Italian 
• Referring now to FIG. 55, shown is an embodiment of a restaurant category may be added. This is generally per- 
method for performing update computation of step 1400 as 35 formed in accordance with the data organization and cat- 
applied to the category file. Recall that the category file in egories of the particular data being examined in each imple- 
one embodiment includes a category identifier and a corre- mentation. 

sponding header that is a text description of the associated At step 1442, redundant categories as stored by business 

category identifier. It should generally be noted that these are collapsed and detected by removing the equivalent 

updates are applied in a model similar to that of the business 40 categories. Generally, at step 1442, semantically equivalent 

listing files for native source updates. The updates are first categories are determined. Generally, this includes locating 

applied to a "raw" or unfiltered version of the category file, equivalent categories for which the spelling might be 

followed by data enhancements as appropriate, an then slightly different, or those fields which may be subsets or 

integration of the data updates into a working copy of the equivalents of other fields. For example, "animal doctor" 

category file included in the working database 1508. 45 may be interpreted as a semantic equivalent for "vet", or 

At step 1460, the current and updated category files are "veteri^a^ian , ^ Generally, this step may be done in an 

compared in terms of identifiers and associated headers. At automated fashion using any programming language which 

step 1462, each update record is classified as one of several is commercially available arid may be used with the existing 

types of transactions. database. The technique involves dropping or not including 

At step 1464, a record in the updated category file is 50 special non-alpha-numeric characters or other words, similar 

considered matching if the record identifier and the associ- to the stop words. White space may be compressed and 

ated header match an entry in the current category file. comparison may be done on a case insensitive manner. The 

At step 1466, an record is inserted into the existing comparison may further be done by requiring an exact 

unfiltered database and working database if the record character match or with some at-a-distance technique similar 

identifier is not in the existing unfiltered database copy of the 55 to those previously described with other data processing, 

categories. At step 1468, data enhancements may be per- At step 1444, the duplicate categories and records may be 

formed and the resulting filtered data further integrated into removed from the existing version as stored in the working 

the existing category file in the working database 1508. The database 1508. 

data enhancements, as included in steps 1468 and 1476, are It should be noted that in general the processing of step 

described in more detail in paragraphs that follow. 60 1442 where there is a collapse of redundant categories by 

At step 1470, a record in the existing category file is detecting and removing equivalent categories, different rules 

deleted if the record identifier of an existing record is not in may be used to decide which category of several duplicates 

the updated version. At step 1472, this deletion operation identified as the one to keep. For example, maybe the longest 

may be performed to the working copy of categories name, the shortest name, or simply the first name, 

included in the working database 1508. 65 Referring now to FIG. 57, shown is a flowchart of one 

At step 1474, an update record is used to update the embodiment of a method of more detailed processing steps 

database copies if the record identifier of an existing an of step 1442 for collapsing redundant categories. At step 
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1520, duplicate categories are determined. A technique for the user for these 10 listing may be 5 listings included in 
determining duplicate categories is described in paragraphs category A, and 5 listings included in category B. Thus, 

that follow in conjunction with FIG. 58. At step 1530, when the category table or file is updated, the table is 

duplicate categories in the unfiltered database may be exam- propagated as part of the update data to the Front End Server 

ined as a group and one of the category names or headings 5 and, subsequently, further to the query engine, 

is chosen to be the heading included in the collapsed [Multi-media Data Transfer] 

category record. One technique for choosing the heading is An efficient data transfer technique is used to transfer data 
be determining which category name is most frequently between databases, such as between the BackOffice compo- 
used, such as by examining the business listing files for nent 818 and the Primary Database 812 of FIG. 4. In this 
frequency determination. At step 1534, the business listing 10 particular embodiment, the types of data that are transferred 
files, as included in the unfiltered database, may be patched generally relate to advertisements such as those displayed to 
with the new heading and identifier corresponding to the the user 800 of FIG. 2. Generally, advertisement data 
collapsed resulting record. At step 1536, the category file is includes text data and non-text data. The non-text data may 
also updated to reflect the collapsed entry. It should be noted be referred to as "blob" data which includes, for example, 
that these are made to the existing working database. is image and audio data, as well as machine-executable 
Referring now to FIG. 58, shown is a flowchart of an programs, JAVA bytecode, and the like. The technique, 
embodiment of method steps for detecting duplicates in the which will be described in paragraphs that follow, generally 
category file. Generally, these steps are more detailed pro- uses different data channels depending on the type of data, 
cessing steps of step 1520 of FIG. 57. At step 1500, a first For example, text data is transferred from the BackOffice 
category name in the category file of the unfiltered database 20 component to the Front End Server 804 using a different data 
is tokenized. In other words, each word included in the channel than blob data that is also transferred between the 
heading or category name is associated with a token. two components. A sending component may be located 
Similarly, in step 1504, the next record of a category is within the BackOffice component 818 which includes soft- 
examined and also tokenized. At step 1506, a comparison of ware that decides the type of data, the channel used to 
the two tokenized names is performed to derive a score in 25 transfer the data, and how to break up the data into portions 
accordance with the number of matching name components. which are transferred to a receiving component located in 
This may also be normalized, as described in accordance the Front End Server 804, such as the primary database 812. 
with the foreign source update processing techniques. At Located on the receiving component, as may be included in 
step 1508, a determination is made as to whether or not the the Primary Database 812, is software which decides how to 
score is greater than a predetermined threshold. In this 30 synchronize or assemble data received from the BackOffice 
instance, the threshold is 75%. If the score is greater than the component 818. In this particular embodiment, the adver- 
threshold, control proceeds to step 1512 where the catego- tisement data is generally data that is displayed in response 
ries are tagged as duplicates propagating any previous to a user query. 

matching identifier tag. In other words, the transitive match- Generally, the text data included in this data transfer may 
ing technique is used in marking matching categories. For 35 be characterized as structured data, as included in text which 
example, if ID1-ID2. Then, it is determined that ID2«ID5, is displayed to the user. The second type of data generally 
IDS is also marked as having EDI as a matching identifier. transferred is denoted as "blob** data which is generally not 
Similarly, subsequent matches to ID5 further propagate the able to be decomposed or operated upon in different por- 
value ID1. Subsequently, control proceeds to steps 1510 for lions. For example, blob data may include a machine- 
advancement to the next record. If it is determined at step 40 executable program which is generally binary data type. 
1508 that the score is not greater than the threshold, no Generally, the technique uses two separate data channels in 
match is found and control proceeds to step 1510 where the which each channel transfers a different type of data. In this 
next category is advanced to. At step 1514, a determination particular embodiment, one data channel is used to transfer 
is made as to whether all the categories have been processed the text data, and Database Link™ software, as included in 
in the category file. If they have, control proceeds to step 45 the commercially available Oracle™ database, is used to 
1516 where processing stops. Otherwise, control proceeds to facilitate database communication of text data. Therefore the 
step 1504 for further comparisons and determinations of database routines, such as those included in the Database 
equivalent categories. Link software, may be used in transferring text data between 

It should generally be notes that various percentages and databases. In this particular emfxxliment, the Oracle data- 
lengths used in the foregoing data integration techniques 50 base does not support direct non-text manipulation, such as 
may be tuned or varied for each particular embodiment in for transferring data of different types, such as blob data, 
accordance with, for example, the data type and record Therefore, a second different data channel is used to transfer 
lengths. Adaptive tuning of values used in making determi- the blob data from one database to another in which the 
nations may be automated, for example, by adjusting thresh- second channel is external to the database since the version 
olds in accordance with actual data values to filter out 55 of the Oracle database software used in this embodiment 
extreme data values, does not provide the needed support for direct non-text data 

It should also be noted that the category table or file may manipulation. The blob data, which may also generally be 

be used by the query engine when processing a data query. characterized as multi-media data, is transferred asynchro- 

For example, the category file may be used to identify valid nously from the text data between databases, 

categories specified in a user query. It may also be used to 60 As will be described in paragraphs that follow, the blob 

categorize information displayed to a user. In other words, a data in this embodiment is copied from one database to 

resulting data set may be partitioned in accordance with the another using a C++ program with calls to vendor-supplied 

categories as included in business listings for the resulting library routines. This is in contrast to the text data transfer 

query. For example, if a resulting data set includes 10 which is done by a separate data channel, and the software 

listings, these listings may be categorized or grouped in 65 used performs remote database copies as if they were local, 

accordance with whether or not particular categories are In this embodiment, the text data transfer may be performed 

associated with each listing. The information displayed to by calls to the Oracle procedures executed under the control 
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of the Oracle database software. Generally, the data channels 
used to transfer both the text and the blob or multi-media 
data may be network connections between the databases. 
Other types of connections between the databases may also 
be possible, such as a dedicated hard line to facilitate 
database communication, as known to those skilled in the 
art. As will be described in paragraphs that follow, data is 
organized and associated with a particular advertisement 
that may be displayed to a user. 

FIG. 59 is a block diagram of two tables in a preferred 
embodiment depicting one technique for storing the adver- 
tisement data. In this particular embodiment, the advertise- 
ment data and the relation between the different components 
of the advertisement data are described in two tables stored 
in the sending databases. Table 1200 is a relational mapping 
table which generally describes the relation between the 
various data entities as included in a particular advertise- 
ment page. In this particular embodiment, as will be 
described in an example, the relational mapping data 
describes a parent/child relationship between various data 
entities of an advertisement page forming a tree-like struc- 
ture. The data table 1220 includes the actual data as 
described by the relational mapping table 1200. The data 
included in the data table 1220 includes a variety of data 
types as may be displayed with regard to an advertisement. 
For example, the data included in table 1220 may be text 
data, machine executable code, or a JAVA program. In this 
particular embodiment which uses the Oracle database 
software, one restriction is that each row of the data table 
1220 may contain at most one field of blob data. Thus, if an 
advertisement, in this particular embodiment, requires the 
use of multiple blob files, they must be stored in different 
rows of the data table 1220. Other implementations and 
embodiments may have similar or other restrictions that may 
effect the particular organization of the data as required for 
advertisements or other data displayed to the user. It should 
generally be noted that the structure of the tables depicted in 
FIG. 59 are particular to this implementation and embodi- 
ment of the invention. Other embodiments of the invention 
may include different table structures in accordance with 
various implementation restrictions. 

The relational mapping table 1200 includes two columns 
of data. The first column 1204 is the record ID of the child 
data entity. The second column 1206 is the record ID of the 
parent data entity. The data table 1220 generally includes 
multiple columns depending on how many data fields are 
required for a particular implementation. In this particular 
embodiment, a record identifier 1208 is used to uniquely 
identify a particular data entity in a table. Also included are 
data fields data-1 1210 through data-n 1214 in which each of 
these data fields includes one particular type of data entity as 
may be displayed to the user in response to a data query. 

Referring now to FIG. 60, shown is a more detailed 
diagram of the tables as used in a data transfer on a sending 
and receiving side using this data transfer technique. Shown 
in FIG. 60 is an example of a relational mapping table 1200 
which includes multiple advertisement pages. In this par- 
ticular embodiment, one tree-like structure is used to rep- 
resent one advertisement page. As shown in FIG. 60, two 
tree structures may be produced using the data described in 
the relational mapping table 1200. What will be described in 
paragraphs that follow is the data transfer of the advertise- 
ment page associated with the root node with the identifier 
104 which includes identifiers 104, 105 and 106 in its 
tree-like structure. 

Referring now to FIG. 61, shown is the tree-like structure 
described by the relational mapping table 1200 for the 
advertisement page with the root node identifier 104 shown 
in FIG. 60 
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. Referring back to FIG. 60, on the receiver side of the data 
transfer, shown are two tables, temporary table 1216, and ad 
page table 1218. n this particular embodiment these two 
tables are created on the receiver side for each advertisement 
transferred from the sender. In the snapshot of FIG. 60, the 
two tables of data on the receiver side depict tables after the 
transfer of the ad page with the root node of the identifier 
101 and prior to the transfer of the data associated with the 
advertisement page with the root node beginning with the 
root node of identifier 104. Generally created on the receiver 
side for each advertisement page is a separate ad page table 
1218. The temporary table 1216 is filled with data during the 
data transfer and after the data is properly assembled on the 
receiver side, the temporary table 1216 is not used until the 
next data transfer operation. In this particular embodiment, 
the table ends in a state such that no data from the data 
transfer having just occurred is located in the table 1216. 

Referring now to FIG. 62, shown is a block diagram of the 
data on the sender side and the receiver side as associated 
with the data table 1220 previously discussed in FIG. 59. In 
the example which will be described in paragraphs that 
follow involving the data transfer of identifiers 104-106, 
each identifier is associated with only blob data. It should be 
noted that this general technique and the data included in the 
data table 1220 may additionally include text data associated 
with each identifier or row in the table. An entry in the table 
1220 may also include only text data. As previously 
described in this embodiment, the limitation is that only one 
field entry of blob data may be associated with each row in 
table 1220. On the receiving side three tables are associated 
with transferring data which is blob data from the data table 
1220. These three tables include a blob temporary table 
1222, a blob table 1224, and a repository table 1226. It 
should generally be noted that any text data included in table 
1220 on the sender side may be transferred using the data 
transfer channel. What is described in FIG. 62 is that portion 
of the data included in the data table 1220 which is blob data. 
In this example, only blob data is included in the advertise- 
ment page with the root node 104 which will be described. 

The blob temporary table 1222 is a temporary table used 
in the transfer of text information associated with blobs from 
the sending node to the receiving node. The blob table 1224 
in this particular embodiment, is an aggregate blob table 
which includes the blob data for multiple advertisement 
pages. In other words, the snapshot of the data tables of FIG. 
62 shows that data associated with one advertisement page 
with the root node identifier 101. After the completion of the 
advertisement page with the root node identifier 104 on the 
receiving side, the blob table 1224 will also include infor- 
mation to retrieve the blob data associated with identifiers 
104 through 106. It should be noted that the contents of the 
blob table 1224 do not include the actual blob data itself. 
Rather, as will be noted in the description that follows, the 
fields included in the blob table 1224 point to and further 
describe the actual blob data which is contained in the 
repository table 1226. The blob table 1224 in this embodi- 
ment includes three fields per each entry associated with a 
blob data entity. It includes a sending record identifier 1228, 
a size 1230, and a pointer 1232 to the actual blob data. The 
60 sending record identifier 1228 identifies a particular blob 
uniquely within a particular table or advertising page in this 
particular embodiment. Thus, each of the entries in the 
record identifier column 1228 may not be unique for all of 
the advertisement pages or data. Rather, the purpose of the 
record identifier is to map or identify the particular blob 
pointer associated with a unique record identifier from the 
sending database. The size 1230 indicates the size in bytes 
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of the blob described by the blob pointer field 1232. In other identifier, identical to that which is produced on the sending 

embodiments, the size field may include other units to side prior to sending the text data to the temporary blob table 

identify the size of the particular blob data. The blob pointer 1222. This information is passed or transferred to the 

field 1232 acts as an identifier or pointer into the repository external process 1240 which copies the actual blob data to 

1226 to uniquely identify within the repository a particular 5 the receiving side as well as the additional information 

piece of blob data. It should be noted that other embodi- described in temporary table 1222. 

ments or implementations may include additional fields in It should be noted that in this particular embodiment, the 

the blob table 1224 as well as in the repository 1226 in external process 1240 is a C++ program with library calls to 

accordance with other pieces of data that may be required in facilitate the transfer of data between the databases, 

order to enable the transfer to occur in a particular imple- 10 However, it should be noted that this is an external process 

mentation. with regard to the database. In other words, in this particular 

FIGS. 62 through 66 show the block diagrams of an embodiment the facilities used to transfer the data from the 

embodiment of transferring the data associated with an sending side to the receiving side are external with respect 

advertisement from the sending side to the receiving side. to the database. In this particular embodiment, "external" 

FIG. 63 depicts a snapshot of the tables associated with the is generally refers to the fact that the external process 1240 

text or Database Link transfer channel as included in the executes outside of the Oracle process space. Certain tasks 

sending and receiving sides. The data table 1200 on the must be performed by the external process in order to 

sending side has no modifications from the previously transfer the data from the sending side to the receiving side, 

described initial table as depicted in FIG. 60. However, the For example, the external process must connect to each of 

tables on the receiving side have been modified from those 20 the databases in order to access and transfer the data. This is 

previously described in FIG. 60. In particular, the temporary in contrast to the Database Link or text channel which is 

table 1216 serves as a temporary placeholder for the data internal to the database and no such connections are implied, 

involved in the data transfer of the particular ad page In other words, the routines which perform the data transfer 

described beginning with root node identifier 104. of the text are internal to the database and data copying, for 

Generally, the data associated with a particular advertise- 25 example, in this embodiment, is performed between remote 

ment page is extracted from the relational mapping table databases as if they were local copies. The precise way in 

1200 and is temporarily copied to and stored in the tempo- which both the text and blob data transfers are performed 

rary table 1216 on the receiving side. within other preferred embodiments may vary with imple- 

Shown in FIG. 64 are the tables associated with transfer- mentation and facilities available for communication and 

ring the actual data from the sending side to the receiving 30 data transfer. 

side. The data included in the data table 1220 is segregated It should also generally be noted that the external process 

into text data and non-text data. The text data is transferred may copy blob data from multiple tables in which the 

using the text channel. The non-text, multimedia data, or associated field name may differ with each table. Therefore, 

blob data, is transferred using an external process which the field name may also be included in table 1242. The 

creates a second multimedia data transfer channel in order to 35 external process uses this field name to retrieve blob data to 

send data from the sending side to the receiving side. In this be copied. Other embodiments may communicate this field 

particular embodiment of the data table 1220, the id and the name using other mechanisms. 

size fields are copied to the blob temporary table 1222. The external process 1240 uses the data included in the 

Additionally, a global id (Gid) is generated on the sending temporary table 1242 to fetch or access the blob data 

side prior to transmitting these fields to the receiving side. 40 associated with a particular table name and field name to 

This global id is transferred to the receiving side and subsequently index into each particular table name using the 

included in each associated entry of the temporary table identifier to extract the actual blob data. This blob data is 

1222. Generally, the Gid is a unique identifier associated copied to the repository table 1226 on the receiving node by 

with each record uniquely identifying the record among all process 1240. In FIG. 64, the repository table 1226 includes 

tables associated with database information. 45 the blob data associated with advertisement identifier 104. 

The blob data from table 1202 and the associated infor- This data is appended to already existing data in the reposi- 

mation in table 1242 are transferred to an external process tory 1226. 

1240 located on the sending side. In this particular It should generally be noted that the transfer of the text 

embodiment, an Oracle™ pipe is the communication means data through a first data channel and the transfer of the blob 

used to transfer the data from the data table 1220 to the so data through an alternate or second multi-media data chan- 

external process 1240. The external process 1240 further nel are performed asynchronously. When the receiving side 

transmits the data via a multimedia data channel to the has determined that all of the necessary data entities asso- 

receiving side. Table 1242 may also be viewed as a tempo- dated with a particular table or advertisement have been 

rary table which serves as a placeholder for that data which transferred successfully to the receiving side, the process of 

is transferred by the external process 1240 to the receiving 55 assembling the data into the advertisement page begins. It 

side. Located in temporary table 1242 are four pieces of should also generally be noted that the data described in 

information including a table name, a field name, an tables 1224 and 1226 are functionally equivalent to the data 

identifier, and a global identifier associated with each blob stored in table 1220. For example, table 1224 includes a blob 

data entity. The table name generally describes or identifies pointer field which acts as an index into the repository table 

the particular table within which a piece of blob data is 60 1226, whereas table 1220 includes the actual blob data in a 

located or associated. In this particular embodiment, each field. Thus, the use of the blob pointer field in table 1224 

table is associated with a particular advertisement or adver- which acts as an index into the repository table 1226 

tisement name. The field name identifies the type of non-text performs the same function as the actual data in the blob data 

data. In this particular embodiment the field name is "Blob" field of the data table 1220. 

referring to blob or multi-media data. The identifier field (Id) 65 What will be described in conjunction with FIGS. 65 and 

of table 1242 is the unique record identifier copied from 66 is the integration process of the tables of the text and the 

table 1220. The global identifier (Gid) is a unique global blob data for the advertisement page identified by the 
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sending identifier 104. Referring now lo FIG. 65, shown is Regarding the multi-media or blob data on the sending 
a block diagram of an embodiment of the tables resulting side and the receiving side, the resulting tables 1224 and 
from the text data integration. In particular, table 1200 on the 1226, in combination, are functionally equivalent to the data 
sending side remains the same as in previously described described in the sending side in table 1220. In this particular 
figures. On the receiving side, table 1216 data has been 5 embodiment, one of the reasons for not further merging the 
integrated and copied into the table 1218. The function of data of tables 1224 and 1226 is due to the fact that trans- 
temporary table 1216 is generally to hold that text data ferring blob data, including a copy of the blob data from 
associated with the relational mapping table which is trans- table 1226 to be integrated into table 1224, requires the use 
ferred from the sending side to the receiving side until all of of an external program in order to compress the tables 
the data entities associated with the particular advertising 10 further. This is due to the fact that in order to perform any 
page or table being transferred have arrived on the receiving transfer of data which is not text, an external program, 
side. At this point, the data integration on the receiving node similar to external program 1240, is generally used since a 
begins. The software on the receiving side performs a state version of the database software, as in this embodiment, may 
integration process. The previously described task of inte- not be capable of copying and directly manipulating' non- 
grating the data from temporary table 1216 into table 1218 is text data as needed in performing data operations, 
is one such task performed by this integration software. The tables which are described in the preceding figures 
Referring now to FIG. 66, shown is a block diagram of an and associated descriptions may have a different number of 
embodiment of the data table 1220 whose contents have entries and fields particular to each implementation of the 
been transferred to the receiving side. The assembling concepts which have been described herein. What has been 
software on the receiver side integrates the data from tern- 20 described is a flexible and efficient technique for performing 
porary table 1222 into table 1224. Additionally, a link is data transfers. In this particular embodiment, the data trans- 
established in table 1224 to the data in table 1226 and the fer is between two databases. The techniques described may 
associated global identifier removed. Each entry in table be adapted and used within other applications and a variety 
1222 is copied into table 1224. In particular, the Id and Size of environments. 

fields are copied into table 1224 for identifiers 104, 105, and 25 The overall technique is generally to copy the text and 

106. The integration software then uses the global Id blob or multi-media data asynchronously on two separate 

obtained from temporary table 1222 to index into the channels. This data is copied from a first database to a 

repository 1226 in search for a matching global identifier second database. Initially, the data is located on the second 

entry. When a matching global identifier is found in table database in a temporary location until all of the portions of 

1226, the repository Id from table 1226 is copied into the 30 the data associated with a particular data transfer arrive at 

blob pointer field (Blob Ptr) of table 1224. Subsequently, the the second database. When it has been determined that all 

global Id in table 1226 for the corresponding entry is portions of the data have successfully arrived on the second 

reinitialized to an empty field. The resulting table 1226 database, me assembly process of copying the data from the 

shows this process as repeated for each entry in the previ- temporary locations and merging the information into other 

ously described table 1222 from FIG. 64. 35 data tables is performed on the second database. 

Referring now to FIG. 67, shown are method steps of one Generally, the foregoing technique for data transfer may 

embodiment for assembling the blob data into the repository be used in a variety of applications, such as for the data 

table. The steps described in FIG. 67 generalize the method ^ transfer between databases. In one embodiment, this tech- 

previously described in conjunction with FIGS. 64 and 66 nique is included in a system for online Interactive Yellow 

wherein the data shown in FIG. 64 is integrated and 40 Pages, GTE Superpages for the publication of multimedia 

assembled into the tables on the receiving side resulting in advertisement content of GTE Superpages business custom- 

those as displayed in FIG. 66. Generally, at step 1250, the ers. Generally, the GTE Superpages system includes two 

record identifier and table size are copied from the tempo- major components: the server component which serves 

rary blob table to the blob table. At step 1254, the global versatile user requests for the information of more than 11 

identifier from the temporary blob table is used as an index 45 million businesses in the United States and (2) the Backof- 

into the repository table to finding a matching global iden- fice component that facilitates advertisement content, cre- 

tifier. For this matching entry, as in step 1256, the repository ation management and publication. Both these subsystems 

identifier is copied from the repository table to the blob include databases where advertisement business information 

pointer field of the blob table. At step 1258, the global is persistently stored. The advertisement content produced 

identifier field of the repository table is reinitialized. The end 50 or modified in the back office is published in the Superpages 

result of performing the steps as described in FIG. 67, result by virtue of its transfer from the persistent storage in the 

in the tables as displayed in FIG. 66 representing the back office to the persistent storage in the server. Generally, 

integrated or assembled blob table in which the blob data is the business advertisement includes an integrated set of 

integrated into the repository table 1226 as fiirther described structured textual information, such as business name, 

by the blob table 1224. It should generally be noted that the 55 address, and multimedia or blob data, such as graphics' 

files resulting from the copying of the text and the blob data video, audio, job applets. 

as described in FIGS. 65 and 66 have a particular relation- The data transfer technique described is generally a 

ship. Generally, the sending and receiving side for the text technique for transferring data using two data links between 

data have mirrored files. In this particular example, table two databases. One of these data links is an internal data link 

1200 and table 1218 are "mirror images" of each other. The 60 with respect to the database, the second data link is an 

temporary table 1216 is used in performing the transfer as a external data link with respect to the database. The internal 

temporary table until all of the data for this particular data data link is optimized for the structured text data transfer 

transfer has arrived on the receiving side. At that point, the while the external one is optimized for the multimedia data 

data is integrated from the temporary table into the final transfer, such as the transference of data stored in binary 

resulting table 1218 resulting in a table 1218 which mirrors 65 objects in the database. This technique for data transfer 

that on the receiving side which is on the sending side in generally alleviates the limitations of the existing database 

table 1200. technology which does not provide for the transferring of 
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multimedia objects using the internal data link. Moreover, GTE Superpages Internet site, that provide access to the 

by using the two data links to transfer the various data types, BackOffice component data information, 

performance and stability are improved over an alternative The data integration techniques, as related to the foreign 

prior art approach which uses only the external link for and native source updates to integrate the data updates into 

transferring both text and multimedia or blob data. 5 the BackOffice component, are generally more detailed and 

Generally, the transfer technique includes four collabora- involved than the integration of the on-line specified raodi- 

tive processes: a process on a sending component which fications. In the former case, the data updates may generally 

decomposes data structures and the like into text and non- be a Iar & e nu mfc*r of data modifications requiring more 

text components assigning transient tags to the non-text computer resources than in the latter case. Thus, for 

components; two asynchronous transfer processes, one per 10 exam P le > the on-line modifications may be incorporated on 

data type, that each transfer, respectively, text and non-text a ^ or ot ^ r P«*teniuiicd time period using some data 

™ m *J™„#o i~ „ 1a .l enhancement techniques as described in other sections of 

components to a receiving component; and a process on the ^ a lication . 0th 4 er data dates ^ 

receivmg .component that reassembles transferred data and time P a P nd tef resour £ s and ^ able 

replaces transient tags with persastent unique tags. completed, for example during nonpeak usage, such as 

This technique uses a multimedia data repository cable is overnight on a daily basis. Thus, additional planning and 

which is created and maintained m the receiving component, different processing techniques may be used with the various 

such as the receiving database in this embodiment. Once the types and volume of data updates as included each embodi- 

data is transferred, the non-text or multimedia data items are ment. 

stored in this repository with transient tags. Using the Once the data modifications are incorporated into the 

transient tags, the reassembly process correlates the text 20 BackOffice component, the data updates, including the 

tables with the multimedia objects and replaces them with updates to advertisement data and other data associated with 

persistent unique tags, thus leading to the reintegration of the each business listing, may be propagated to the Front End 

transferred data. Server component. The non-text or multimedia data, for 

The previously described technique includes features example, as included in advertisements with image files, 
which provide for efficient decomposition and reassembly of 25 mav De transferred to the Front End Server from the Back- 
data for efficient data transfer, as between two databases. office using multimedia transfer techniques, as generally 
Additionally, the multimedia repository serves as a vehicle described in other sections of this description. The updates 
for the reassembly of decomposed data items which are to the Primary Database included in the Front End Server 
reassembled on a receiving component, such as a receiving ma Y be communicated as a table of commands created in the 
database. 30 BackOffice component and transferred, as by a network 
[Incremental Update] connection, to the Front End Server. Generally, in this 

In paragraphs that follow, a description is provided of an embodiment, the table created in the BackOffice includes an 

incremental update procedure as performed upon the various application developed command language corresponding to 

databases included in the Front End Server component 804. the various types of record updates and modifications that 

The data in the BackOffice component 818 may be updated, 35 mav included in this particular embodiment. Each of 

for example, on a daily basis. These deltas or changes to this these commands may be further translated in the Front End 

database in the BackOffice component are subsequendy also Server into one or more actual database commands that 

applied to the copy of the database in the Front End Server perform the table operation. For example, an entry in the 

component. It should generally be noted that in this table of database update commands may be specified as 

application, as in the GTE Superpages online system, the 40 follows: 

number of transactions or updates to a database ranges from COMMAND RECORD # OPTIONAL DATA 

30,000 to a half a million on a daily basis in accordance with DELETE 1-5 

the required data updates for the existing database. However, In this above example table, three fields of data may be 
the techniques which will be described in paragraphs that included. A Command field specifies the type of data corn- 
follow may be applied to different systems with different 45 mand. The Record #field identifies which records in the 
transaction throughput and tuned in accordance with each Primary Database this command applies. The Optional Data 
particular implementation. includes data that may be related to the specified command. 

Generally, this update technique is used to provide data For example, if the command were update, the data field 

updates for both native and foreign sources, and on-line may specify the data which is to be included in the records 

updates, as described in accordance with data processing 50 specified. In the above example, the command is to delete 

techniques in other sections of this application. records 1-5. This single table command may be translated, 

Generally, data updates to the databases included in the for example, by software included in the Primary Database, 

Front End Server may first be integrated into the BackOffice into 5 database commands in accordance with the particular 

component. Subsequently, these data modifications may be database software. The software which builds the table in the 

"pushed" to the Front End Server and integrated into the 55 BackOffice and translates the commands into one or more 

various data stores included therein, as will be further database commands may be developed using a coramer- 

described in more detail in following sections. Generally, in daily available software system that is capable of commu- 

this embodiment, data updates may originate from several nicaiing with the underlying database to perform the 

sources, including native and foreign source updates, and required operations. 

on-line data entry, such as through an Internet connection via 60 It should be noted also that the entire table may be 

a browser. The native and foreign source updates may transferred from the BackOffice to the Front End Server, or 

generally be characterized as larger updates or data integra- it may be divided into sections and updates performed for 

tion efforts. These are generally described in other sections each section. Additionally, each command may be sent as a 

of this application. The on-line data entry technique for separate message in other embodiments in accordance with 

updating information that may be included in the BackOffice 65 the number of updates and other associated computer 

component may be performed as previously described resources and costs for each data transaction. This may vary 

through the menus initially displayed to a user, such as at the with implementation. 
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Referring to FIG. 31, shown is an embodiment of a At step 1622, database preparation steps are performed, 

dependency graph for performing the various processes in Step 1622 serves several purposes. One is a coordination 

an incremental update. At step 1600, the BackOffice data point for the updates of the various ads, as well as the 

transfer must complete prior to beginning the update to the various term list identifiers. Secondly, step 1622 serves as a 

database in the Front End Server component. The BackOffice 5 ste P, within which the normalized Primary Database in for- 

data transfers is complete when multimedia and text data has mation is propagated from the normalized copy of the 

been transferred from the BackOffice component, such as Primary Database to a denormalized form in the Primary 

data required when updating an advertisement page. database and the denormalized form in the Secondary Data- 

AddiuonaUy, other information from the BackOffice com- base * Io oth ™ words > the chan S es which are transmitted from 

ponent is transferred to the Front End Server component io BackOffice component and reflected in the normalized 

804, such as in the form of an operational table. The Primary Database copy are now further propagated to the 

operational table may include information about the updated de ° orm f f K d Pnm ^ ***** and the denormalized Sec- 

nonnalized data, which has been applied to the BackOffice ^Zhf™ ^ "tig*™** r ^ °j 

7 . • . • , , . , . the database preparation, the validity of the transactions and 

component, and which is now to be applied m this mere- updates m verified such that at step 1626 the database 

mental update procedure to the Primary Database copy of 15 it may fully commit to performing the update to the 

the normalized data. denormalized copies as used in performing user queries. 

At step 1602, an initialization procedure may be executed steps 1624, and 1630, and, respectively, step 1626 may be 

to synchronize the beginning of the update procedure for the performed in parallel. After the database preparation of step 

steps that will be described in paragraphs that follow. As 1622, the ads may actually be published as in step 1624 in 

indicated by FIG. 31, steps 1604, 1606, and 1608 may be 20 which the updated copies of the Constructed Ad Repository 

performed independently and at the same time as steps 1610 are actually made available for use. Additionally, any 

through 1620. The coordinating point labeled DB Prep at updated images as stored in the Image Repository are also 

step 1622 serves as the coordinating point for the different available for use. At step 1630, the previously installed 

procedures performed in updating the database on the Pri- identifiers included in the Term lists, as installed in step 

mary Database, and the local copies of necessary files, such 25 16 ^0, ^ published in step 1630. At step 1630, the publi- 

as the Term list identifiers, located on each of the server cation of the various identifiers included in the Term lists 

nodes. generally means that the Term lists are available for use, as 

At step 1604, the various advertisements are extracted bv tbe ° uer y Engine. At step 1626, which may be performed 

from the data tables, such as those transferred from the m P 3 ™ 11 . 6 ; 1 ™? the f e P s of P ublishin g the ads and publish- 

Backoffice component in the multimedia and text data 30 ldenUfiere ' me database coramits t0 Performing the 

transfer. At step 1606, the various advertisement paces are *7 , , „ . . . , . „ . . 

i -i j j • * It should generally b« noted that steps 1614 through 1620 

packaged and made into a complete advertisement page to _ _ r- Za ;„a*„a~ *i f I , ^ IT" 

f f a - .u * * j aj t% *. oa~* \r . are performed independently for each server node in this 

be stored in the Constructed Ad Repository 842. At step ^iT.- . A , / , . . 

1iCno . , 4 , , . r * j ■ , j j . . embodiment Additionally, the actual amount of processing 

1608, the constructed ads are transferred and mcluded in the performed on the Term lists varies in accordance with the 

Constructed Ad Repository. It should be noted that in this 35 number of updates or transactions, as will be described in 

embodiment the existing copy of the Constructed Ad conjunction with FIG. 32. 

Repository is updated in accordance with those particular Referring now to FIG. 32, shown is one embodiment of 

ads which have changed. Thus, the Constructed Ad Reposi- the various method steps for performing update steps in 

tory is updated on a delta or change basis. accordance with a particular number of update transactions 

Simultaneously, steps 1610 through 1620 may be per- 40 as sent from the BackOffice component 818. At step 1634, a 

formed in conjunction with steps 1604 through 1608. This determination is made as to the number of update transac- 

may be done, for example, in a parallel fashion. Steps 1610 tions. This determination involves a comparison with two 

through 1620 indicate that process by which the various threshold values each describing a particular threshold num- 

identifiers and other files associated with the Primary and ber of transactions. Generally, THRESHOLD 1 describes a 

Secondary database are updated. Steps 1604 through 1608 45 relatively small number of transactions. In this particular 

reflect the updating of the Constructed Ad Repository 842 on embodiment, a relatively small number of updates generally 

an as-needed basis in accordance with changes which have refers to less than 30,000 update transactions. Also specified 

occurred in the advertisements. is a THRESHOLD 2 value which generally represents a 

At step 1610, various changes to the Term lists identifiers second, larger number of transactions. In this particular 
are extracted. In other words, it is determined at step 1610 50 embodiment, THRESHOLD 2 represents approximately 
what identifiers in the Term lists need to be updated in half a million transactions or update entries which corre- 
accordance with the changes transferred from the BackOffice sponds to approximately five to ten percent of the number of 
component. This is described in more detail in paragraphs records included in the Primary Database. Generally, as 
that follow. At step 1612, these various identifier updates are described in conjunction with FIG. 32, one of three update 
packaged. At step 1614, these various identifier changes are 55 techniques may be applied. If the number of update trans- 
transferred to each of the server nodes. In this embodiment, actions as determined at step 1634 is less than the THRESH- 
the actual data transferred at step 1614 are tbe raw opera- OLD 1 or a relatively small number of updates, steps 1636 
tional commands as may be supplied by the BackOffice and 1638 are executed. In step 1636, the normalized Primary 
component to be applied to the existing Term lists. At step Database is updated. Generally, this is performed at step 
1616, at each node, a working copy is made of the existing 60 1602 of FIG. 31 in which the copy of the normalized 
Term lists. At step 1618, on each of the server nodes, the Primary Database is updated in accordance with the opera- 
changes are made to the working copy local to each server tional table as transferred from the BackOffice component 
node. At step 1620, the updated term list is installed. At this indicating the actual database update operations. At step 
point, the updated term list is not yet available for public use 1638, due to a relatively small number of transactions 
in the sense that it is published. However, a new version of 65 required, the actual identifiers of the Term lists are updated, 
the Term lists has been created which includes the updated In other words, the Term lists are updated as opposed to 
information as supplied in the transfer step 1614. being rebuilt. 
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At step 1634, if a determination is made that the number In this particular embodiment, as may be included in the 
of transactions is greater than or equal to THRESHOLD 1, BackOffice component, the various updates to a particular 
and also less than the greater threshold, THRESHOLD2, record or for a particular business or service may be col- 
steps 1640 and 1642 are executed. At step 1640, the Primary i apse d before actually issuing the various database corn- 
Database is updated, as previously described in conjunction 5 mands to perform the updates. In other words, within a 
with step 1602 in which the normalized copy of the Primary amount of time> ^ „ within five hours> a 
Database is updated. At step 1642 all of the identifiers as record may ^ inserted, deleted and modified dozens of 

^ m , ^,K e T ^ ^ t ^JP* P3Ttl u ^ times. He end result of these modifications for the small 

embodiment, both identifiers and markup files are rebuilt * » i i* * j ^ . . 

duetotheuseofthemark-upfilesby the Verity Information ^ f ^almay result in no net modification or amend- 

Retrieval software. As previously (fcscribed in conjunction 10 ™ ent *? * Vf 1 ™ 1 ™ record. Tims, one optimization, as may 

with FIG. 25, the Extraction Routines are executed to again * eluded m the BackOffice component in a preferred 

produce the markup language files and various update embodiment, may collapse various updates associated with 

records needed to update the denormalized data of the a Particular record or business before actually issuing com- 

Primary Database. In step 1642, the Information Retrieval mands which perform a database update as applied to the 

software is executed to produce entire new sets of the Term 15 copies in the BackOffice 818 and Front End Server 804 

lists. Step 1642 is in contrast to step 1638. Rather than components. Generally, this may be determined by using a 

rebuild the Term lists as in step 1642, the Term lists are finite state machine with the states of "insert", "delete", and 

updated in step 1638. "modify". If the same record, for example, is modified twice 

If a determination is made at step 1634 that the number of and then deleted, the net result is that only a "delete" 

update transactions is greater than or equal to the larger 20 database command should be issued rather than issue two 

threshold, THRESHOLD 2, step 1644 is executed. At this updates followed by a delete. 

point, a determination has been made that the number of Also, in this particular embodiment, the contents of the 

update transactions is so large that it has been deemed more Page Cache 848 and the Query Cache 850 are reinitialized 

efficient to rebuild the entire database and associated files, when an update is performed, as in performing the incre- 

rather than update or patch the existing database and asso- 25 mental update procedures described in conjunction with 

dated files, as in updating the identifiers of the Term lists of FIGS. 31 and 32. The data included in the PHTML execution 

step 1638. tree is also reinitialized. 

The previously described procedure of performing a mul- A failure may occur when performing any of the steps 

timedia data transfer is used to transfer, for example, the associated with FIGS. 31 and 32. If a failure occurs when 

multimedia and text data associated with ads, as may be 30 performing certain steps, then a recovery procedure may be 

included in the Constructed Ad Repository 642 and Image performed. In this particular embodiment, a failure may 

Repository 842 of FIG. 4. The granularity which indicates occur for example, when using the Information Retrieval 

that an advertisement page has changed requiring the entire software, as depicted in conjunction with FIG. 25. This may 

advertisement page to be replaced in the Constructed Ad be due, for example, to a problem, such as a software bug, 

Repository is if a single component within an ad page has 35 with the Information Retrieval software 908. For example, 

changed. In this case, the entire ad page is reconstructed and an error may occur when extracting the identifiers associated 

replaced in the Constructed Ad Repository 842. For other with step 1610. Generally, step 1610 as previously described 

systems, a different granularity of change may be used. includes building the Term fists as determined in accordance 

Generally, as previously described, the various markup files with the number of update transactions in accordance with 

and Term lists are built as needed in accordance with the 40 FIG. 32. If an error occurs, for example, when producing or 

number of transactions as described in conjunction with rebuilding the identifiers in the Term lists as in performing 

FIG. 32. The actual threshold values may be determined in step 1642 and step 1644, it may be a recoverable error if 

accordance with tuning of a particular system and the size of another node has successfully built the identifier files, for 

the database the number of transactions in each particular example. In this instance, where there has been a successful 

system. In this particular embodiment, the database as 45 build of the various identifiers on another server node, a 

included in both the Front End Server and the BackOffice recovery procedure may be to copy the updated version of 

component are Oracle™ databases. The Oracle™ proce- me Term lists from one node to another node which has been 

dural language, PL/SQL, may be used to read the opera- unsuccessful in the building the Term lists. This copy may 

tional table and perform the updates as needed to the occur, for example, after a predetermined number of builds 

normalized form of the data as stored in the Primary 50 of the Term lists on a particular node have failed. In this 

Database included in the Front End Server component. particular embodiment, this has been determined to be a 

Similarly, the same procedural language in files may also recoverable error with which an alternative step or technique 

used to update the denormalized Primary Database copy and may be applied to also achieve the end result of the updated 

the denormalized form of the data as stored in the Secondary Term lists. Other embodiments of the invention may also 

Database. Other embodiments may employ other techniques 55 include other alternative techniques in accordance with 

to update both the Primary and Secondary databases in those steps associated with a particular system which it 

accordance with a particular implementation. determines to be recoverable. 

In this particular embodiment, the previously described In the previously described embodiment, the update tech- 
incremental update procedure is one that is generally used to niques may be included in a distributed computing system 
perform daily updates. However, in other embodiments, the 60 having multiple data representations as stored in a plurality 
same procedure may be used on a larger time period of of server nodes. The foregoing techniques provide for syn- 
transactions or updates. Due to the volume and size of the chronized updates of the various data stores in the plurality 
previously described embodiment, this procedure is one of server nodes, 
which performs well when performed on a daily basis. For [Targeted Banner Advertisements] 

other systems which may perform a similar number of 65 User query information may be used to influence the 

transactions for a larger time period, the previously displays shown to the user by the browser 824. In addition 

described techniques may also be used. to displaying matching categories or business listings, as 
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depicted in FIG. 44, the information retrieval software 908 where the system returns to the first category that was not 
can be used to assist in selecting other information to be manually assigned, and it is determined whether the cat- 
displayed to the user, based on the nature of the user's query. egory will be assigned automatically based on the manual 

In an embodiment of the invention, a banner ad 50 can be assignments. If at the step 114 it is determined that the 
displayed to the user. Based on the user's query, the banner 5 category will be assigned automatically based on the manual 
ad 50 may be targeted to characteristics of the user that are assignments, then, at a step 116, the system may compare 
inferred from the user's query. For example, an advertiser terms that appear in the category to terms that appear in each 
might conclude that a user who has entered a query with the of the manually assigned categories. The system may thus 
category "art supplies" is interested in art, so that an obtain a ranking of the manually assigned categories in order 
advertisement for an art show or related matter would be an 10 of the degree of co-occurrence of terms. Next, at a step 118, 
appropriate banner ad 50. Banner ads 50 can also be targeted the system may assign the same super-category as was 
geographically, so that ads for businesses from a selected assigned the highest-ranked of the manually assigned cat- 
geographical area can be associated with search queries that egories. Next, at a step 120, the system may determine 
include that geographical area as a search term. It should be whether there are any additional categories. If not, then 
understood that a system for targeting banner ads using user 15 control passes, as depicted by off-page connector B, to the 
queries can use a range of information retrieval techniques, flow chart 52 of FIG. 68. If additional categories remain, 
such as the Verity techniques described above in connection then control proceeds to the step 114 for the next category, 
with processing of information retrieval requests using the If at the step 114 for a particular category it is determined 
term lists 836. However, in an embodiment, a separate that a category will not be automatically assigned based on 
banner ad retrieval program 909 is part of the query engine 20 the manual assignments, then at a step 122 a determination 
862. is made whether additional categories remain to be assigned. 

Initialization steps that permit execution of a banner ad If so, then at a step 124 processing skips to the next category 

retrieval program 909 are set forth in a flow chart 52 on FIG. and control is returned to the step 114 for the next category. 

68. Upon initialization, at a step 54, the system initiates the Thus, after manual assignment of all categories that are to be 

banner ad retrieval software 909. At a step 56, the banner ad 25 manually assigned is complete at the steps 104 through 106, 

retrieval software 909, in a manner similar to the informa- then all categories that are to be automatically assigned 

tion retrieval software 908, uses extraction routines to access based on the manual assignments may be completed at the 

markup language files and extract data. The banner ad steps 115 through 118 before control proceeds to the step 

retrieval software then generates banner ad term lists 837. At 126. 

a step 66, the banner ad retrieval software retrieves a list of 30 At the step 126, processing returns to the first remaining 

all yellow pages categories. In an embodiment, the catego- category that was not previously assigned. At a step 128 the 

ries are all of the available categories of business listings, system may determine certain statistics regarding the 

such as all available yellow pages categories. Next, at a step co-occurrence of terms between the category and one of the 

68, the system establishes a set of super-categories. The super-categories (perhaps also including the terms in the 

super-categories may consist of a sub-set of the categories, 35 categories assigned to the super-categories). A variety of 

or other categories. The super-categories are preferably co-occurrence techniques can be used. At a step 130 the 

smaller in number than the categories, as the super- system may assign the category to the super-category for 

categories will be used to simply assignment of targeted which the highest co-occurrence is found. At a step 132 it is 

banner ads to particular user queries and results of the determined whether additional categories remain to be 

queries. Next, the system may map categories to super 40 assigned. If not, then control proceeds, represented by 

categories in a step 70. The mapping at the step 70 many be off-page connector B, to the flow chart 52 of FIG. 68. If so, 

a many-to-many mapping. A variety of techniques may be then control proceeds to the step 126 for processing of the 

used to map categories to super-categories. One such tech- next un-assigned category. Although an embodiment of a 

nique uses a combination of automatic and manual mapping. technique for mapping categories to supercategories is dis- 

Steps for accomplishing such a technique are set forth in a 45 closed herein, it should be understood that other techniques 

flow chart 73 depicted in FIG. 69. First, at a step 104, it is are available. For example, manual mapping could be 

determined for a first yellow pages category whether the executed after all automatic mapping is completed, or the 

category is to be manually assigned. If so, then at a step 106 system could rely entirely on automatic mapping, 

the category is assigned to a supercategory. This may be Once control has returned to the flow chart 52 of FIG. 68, 

accomplished by user input in a conventional form. Next, at 50 meaning that all yellow pages categories have been mapped 

a step 108, it is determined whether any unassigned catego- to a super-category, at a step 77 the banner ad retrieval 

ries remain. If at the step 108 additional categories remain, software 909 may index the various super-categories in a 

then control returns to the step 104, where it is determined banner ad term list 837. The banner ad term list 837 may 

whether the next category is to be manually assigned. If at take the form of a linked list of the super-categories, with 

the step 108 no categories remain to be assigned, then 55 each element in the fist consisting of all of the terms that 

control is control is returned, as represented by off-page appear in the super-category, as well as all of the terms that 

connector B, to the flow chart 52 of FIG. 68. appear in each of the categories that was matched to the 

If at the step 104 it is determined that the category will not super-category. It should be understood that these terms may 

be assigned manually, then it is determined, at a step 110, be expanded, as described in connection with FIG. 40 above, 

whether there remain any additional categories to be 60 so that synonyms and related terms are also stored with each 

assigned. If so, then at a step 112, the category is skipped and super-category element. Storage of these terms may be in a 

processing proceeds to the next category at the step 104. hierarchical structure that is capable of execution using 

Thus, all categories that are to be assigned manually may be PHTML scripts or similar techniques, 

assigned prior to automatic assignment of categories. Next, at a step 72 the system may match one or more 

If at the step 110 it is determined that no additional 65 banner advertisements to each super-category. Thus, if that 

categories exist, then all categories to be assigned manually super-category is found to be the appropriate super-category, 

have been assigned, and control proceeds to a step 114, the matching banner ad or ads will be displayed. 
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At any time after initialization of the system, the system and in the categories obtained at the steps 60 and 62 in the 

may generate a banner ad for display to the user. The banner banner ad term lists 837. Location of a relevant term list 837 

ads may be stored on a server, which in an embodiment is may be accomplished through use of a table of pointers or 

a separate banner ad server 809. Depending on the desires of other conventional technique. In the case of use of a table, 

the host, the banner ads may be either conventional banner 5 the argument of the table may consist of a tokenized version 

ads or targeted banner ads. In the case of conventional °. f m e ^nn and the table may point to the location of the 

banner ads, the banner ad server 809 may store the banner hnked term list 837 for that term in the database that stores 

ads in a conventional manner and cycle between different me banner ad term ^ 837 - 

ads according to a predetermined routine, such as a round- Referring to FIG. 71, a structure for a linked banner ad 
robin routine, so that when the system calls for a banner ad 10 term ^ 837 15 de P lcted > ™ wn *<* a linked list of super- 
men as via an appropriate URL for the banner ad server), categories is depicted. One linked list may be established for 
the current banner ad is sent to the front end server 804 for each ^ ^ ^l*" 5 m a usei " ,s °. uer y or 111 a category, 
further processing and display to the user in a banner on the such ^ a y eUow P*S es category, retrieved by the informa- 
user's browser 824. ^ on retrieval software 909. Thus, for a given term, such as 
If a targeted banner ad is desired, then the banner ad 15 "^urant," a linked list 837 of super-categories was estab- 
retrieval software 909 may be initiated. Steps that may at me "utilization step 77 depicted in the flow chart 
accomplished by an embodiment of the banner ad retrieval 52 of FIG - 68 ^ lkked llst ma y lmk eleme nts 74, with 
software 909 are depicted in a flow chart 132 as shown in each eIement 74 corresponding to a document (a document 
FIG. 70. First, at a step 60, the banner ad retrieval software m m,s ^ consisting of all of the words in a particular 
909 obtains the user's query. Next, at a step 62, the banner 20 subcategory, plus all words in the categories mapped to 
ad retrieval software obtains the categories that match the me super-category) that includes the term. The elements 74 
user's query. These categories may be the categories that are may mclude sub_eIements ' ^eluding a document identifier 
obtained by the information retrieval software 909 in 76 for ldent ifying the category and certain statistics regard- 
response to a user query. For example, if the user enters a mg me document, including the term frequency 78, TF, 
query for "art supplies," as depicted in FIG. 43, the user 25 which indicates the number of times the term appears in the 
might retrieve a list of matching categories, such as the eight document, and the inverse document frequency 80, IDF, 
matching categories depicted in FIG. 44. In an embodiment, which mdicates me inverse of *e nu ^^ of times the term 
the categories are those that were displayed as a results page a PP ears m me entire set of documents that are being 
in the flow chart 88 at the step 102 in FIG. 41. That is, the searched. 

categories are yellow pages categories of each of the busi- 30 From me table of UnkBd ^ of super-category terms 

ness listings retrieved in the information retrieval query that established ™ the step 77, the banner ad retrieval software 

was executed by the system. 909 ma y at a ste P 81 rank ^ subcategories. In particular, 

Once a list of categories is obtained at the step 62, a me system at the step 81 may rank the documents, i.e., the 
variety of techniques could in theory be used to identify a super-categories, according to the appearance of the words 
banner ad for the category. For example, an advertisement 35 occurring ™ me ™** query and in the categories, 
could be assigned to each category. Thus, referring to FIG. ne rankin g ma y 06 performed by a variety of techniques. 
44, the category "Arts & Crafts" could be assigned a 0ne such technique obtains a number for each term that 
particular banner ad (or set of scrolling banner ads), while a PP ears m me user ^ 1Q the categories that consists 
the category "Artists Materials & Supplies" could be of me P rodu «* of term frequency for that term and the 
assigned a different banner ad or ads. This approach presents 40 mverse doc ument frequency for that term. The sum of all the 
a number of problems. First, the number of actual yellow resulting numbers may be calculated for all super- 
pages categories is very large, more than seventeen thousand categories, and the supercategory with the highest sum may 
in an embodiment of the system disclosed herein, so that the ^ me ^ csX ranked doc ™ent The banner ad that was 
process of assigning ads to categories on a one-to-one basis *»igned to that highest ranked subcategory at the step 72 
would be extremely time consuming and laborious. Also, 45 of me flow chart 52 men be ^splayed upon completion 
because advertisements often include time-sensitive of me ranking step 81 of the flow chart 132. 
material, they are changed frequently, meaning that the ° tber techniques for weighting may also be used For 
ongoing process of assigning ads to category could be very example, if a term is a high frequency term, it may not make 
difficult. Since many of the categories are quite similar to much differeDce m ^S^ 1 significance whether the term 
each other, as in the above example of "Arts & Crafts'* and so occurs ' for exam P le > one thousand times, in the search, or 
"Artists Materials & Supplies" it is instead preferable to whelher the term occurs one million times. In order to 
assign ads to subcategories, as was disclosed in connection collapse me significance of such high frequency terms, it 
with FIG. 68. may be desirable to use the a logarithm or related measure 

Another problem with an approach of matching adver- of me term frequency and the inverse document frequency, 

tisements directly to categories is that additional information 55 rather lhan me raw numbers - the Averse document 

about the user's preferences may be available from the user frequency may be defined as: 

query. A system that relies only on the categories ignores any XDF-Iog {N-lDF^/\og (W) 

information from the user query that might permit further where N is the number of documents in the document set and 

refinement of the advertisement ^fection mF ^ raw mversc docamcai frequency number. Similarly, 

Referring to FIG. 70 once the banner ad retrieval soft- 60 a aatistic can te ^ t0 determine the term frequency, TF. 

ware 909 has obtained the terms in the user query and the A ^tic ^ Robertson's term frequency for a 

terms in each of the matchmg categories, the terms may be doC ument is defined as follows: 
weighted or normalized by the number of occurrences of the 

terms and the number of listings in which a term occurs in ktf-tf/((tf^o.5+15(PL/adl)) 

a step 74. 65 where TF is the raw frequency of a term in a document, DL 

Next, at a step 79, the banner ad retrieval software 909 is the length of the document, and ADLis the average length 

may locate the particular terms that appear in the user query of a document in the search. 
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These statistics may be further improved by weighting 
other factors. For example, it is possible to weight each term 
that appears in one of the categories that is retrieved upon 
execution of a user query and to normalize the IDF and RTF 
statistics over the weights. Thus, if a particular category 
deserves a higher weight, then it might be accorded higher 
weight in ranking super-categories. For example, a category 
that is manually mapped to a super-category might be given 
a higher weight than a category that is automatically 


2. The method of claim 1, wherein the computer system 
is a distributed computer system having at least two nodes. 

3. The method of claim 1, wherein the computer system 
includes two or more processors. 

4. The method of claim 1, wherein at least one of the 
nodes in the computer system includes a multi-processor 
computer. 

5. The method of claim 1, wherein the data domain 


mapped. The user query might be given a higher or lower 10 includes business listings. 


weight, than other information. Categories with a large 
number of listings may be given higher weight. In an 
embodiment, each category is given a weight corresponding 
to the number of listings that are associated with the 
category, normalized by dividing the total number of list- 
ings. In an embodiment, the user query terms are each given 
a weight of one. In the weighting process, the weight may be 
multiplied by the term element in performing the sura of the 
product of term frequency and inverse document frequency 
over all terms for all documents in the super-category linked 
list. Thus, with the weights, a normalized version of the 
Robertson's term frequency statistic can be obtained, per- 
mitting improved tuning of search queries beyond what is 
accomplished with use of the conventional Robertson's term 
frequency. 

Upon completion of the ranking step 81, the highest 
ranked super-category is selected, and a banner ad that was 
assigned to that super-category at the step 72 of the flow 
chart 52 of FIG. 68 is selected. The banner ad may be 
retrieved, such as via a URL, from the banner ad server 809, 
for display to the user via the browser 824. 

While the invention has been disclosed in connection with 
the preferred embodiments shown and described in detail, 
various modifications and improvements thereon will 
become readily apparent to those skilled in the art. 
Accordingly, the spirit and scope of the present invention is 
to be limited only by the following claims. 

What is claimed is: 

1. A method executed in a computer system for perform- 
ing redundant data query caching in a computer system 
comprising: 

partitioning a data domain into one or more partitions; 
associating one or more of said partitions with one or 

more nodes in a computer system; 
classifying a request for a data query as pertaining to a 

particular one of said partitions; 
routing the request to a node in the computer system in 

accordance with said particular one of said partitions; 

and 

executing set manipulation techniques on data from a data 
query cache associated with said node in performing a 
data query included in said request. 
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6. The method of claim 5, wherein partitioning the data 
domain includes forming partitions in accordance with a 
geographic area associated with each of said business list- 
ings. 

7. The method of claim 1, wherein the data query cache 
is stored on a persistent nonvolatile storage device. 

8. The method of claim 1, wherein routing a request to a 
node in the computer system includes routing the request in 
accordance with static and dynamic information about the 
one or more nodes in the computer system. 

9. An apparatus for performing redundant data query 
caching in a computer system comprising: 

machine executable code for partitioning a data domain 

into one or more partitions; 
machine executable code for associating one or more of 

said partitions with one or more nodes in the computer 

system; 

machine executable code for classifying a request for a 
data query as pertaining to a particular one of said 
partitions; 

machine executable code for routing the request to a node 
in the computer system in accordance with said par- 
ticular one of said partitions; and 

machine executable code for executing set manipulation 
techniques on data from a data query cache associated 
with said node in performing a data query included in 
said request. 

10. The apparatus of claim 9, wherein the data domain 
includes business listings. 

11. The apparatus of claim 10, further including: 
machine executable code for forming partitions in accor- 
dance with a geographic area associated with each of 
said business listings. 

12. The apparatus of claim 9, further including: 
machine executable code for routing the request in accor- 
dance with static and dynamic information about the 
one or more nodes in the computer system. 


